<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Yeah, so this page shows that c11x regex is still mostly

      unsupported in gcc:</p>

    <p><a class="moz-txt-link-freetext" href="http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1">http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1</a></p>

    <p>(see section 7)</p>

    <p>And the old school gnu regex we use otherwise I don't think knows

      anything about wide chars.  It simply compares bytes and does have

      a clue if some should be considered part of the same byte.  I

      suspect that because nowhere do we tell it that we're giving it

      UTF-8.</p>

    <p>Ultimately my hope is that gcc will improve eventually and solve

      our problem for us.  We could use<br>

    </p>

    <p>We could add an option to use ICU RegexMatcher, but I'm still

      holding out for our compiler.</p>

    <p>Troy<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 03/06/2017 05:52 PM, Karl Kleinpaste

      wrote:<br>

    </div>

    <blockquote

      cite="mid:315400d9-b73c-14b4-f601-3bc406117969@kleinpaste.org"

      type="cite">

      <meta content="text/html; charset=windows-1252"

        http-equiv="Content-Type">

      <div class="moz-cite-prefix">On 03/06/2017 05:25 PM, Greg Hellings

        wrote:<br>

      </div>

      <blockquote

cite="mid:CAHxvOV+bn4WpfAmUPQmarZZz-y3tu7tqLyPMOLA2h4t+OdFo8w@mail.gmail.com"

        type="cite">being off by 2 would seem strange to me</blockquote>

      <font face="FreeSerif">I don't understand this question at all.<br>

        <br>

        0xE2 = 226 = 0342<br>

        0x80 = 128 = 0200<br>

        0x93 = 147 = 0223<br>

        <br>

        There's no off-by error at all.<br>

        <br>

        "od" is the "octal dump" tool; given -c, it tries to dump

        characters, but outside 7-bit ASCII, it still dumps octal.<br>

        <br>

        For those familiar with dc(1), this will make sense<br>

        $ dc<br>

        8o<br>

        226p<br>

        342<br>

        128p<br>

        200<br>

        147p<br>

        223<br>

        16i<br>

        0XE2p<br>

        342<br>

        0X80p<br>

        200<br>

        0X93p<br>

        223<br>

        <br>

        The interesting questions are why C++11 regex can't find <i>en

          dash</i>, and why non-C++11 regex doesn't understand

        multibyte.<br>

      </font> <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>

<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>

Instructions to unsubscribe/change your settings at above page</pre>

    </blockquote>

    <br>

  </body>

</html>