<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <br>
    <br>
    <div class="moz-cite-prefix">Il 14/05/2019 22:55, Cyrille ha
      scritto:<br>
    </div>
    <blockquote type="cite"
      cite="mid:b38b1e01-b8f5-5d73-5a1b-e356dd5a9698@gmail.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <br>
      <br>
      <div class="moz-cite-prefix">Il 14/05/2019 22:45, Michael H ha
        scritto:<br>
      </div>
      <blockquote type="cite"
cite="mid:CAJ9hia-x+XZnH6UqZkHx5mkQqoagkNrMfGbziBLtup+_BfLhHg@mail.gmail.com">
        <meta http-equiv="content-type" content="text/html;
          charset=UTF-8">
        <div dir="ltr">
          <div class="gmail_default"
            style="font-family:garamond,serif;font-size:large">Cyrille,
            did you start from the PDF or the pagemaker file?</div>
        </div>
      </blockquote>
      PMaker<br>
      <blockquote type="cite"
cite="mid:CAJ9hia-x+XZnH6UqZkHx5mkQqoagkNrMfGbziBLtup+_BfLhHg@mail.gmail.com">
        <div dir="ltr">
          <div class="gmail_default"
            style="font-family:garamond,serif;font-size:large"> Either
            way, you should send a snippet to your source and validate
            the words are still readable. As small as 30 words should be
            enough. <br>
          </div>
        </div>
      </blockquote>
    </blockquote>
    <div dir="ltr">The convert text? If yes look the attached file.<br>
    </div>
    <blockquote type="cite"
      cite="mid:b38b1e01-b8f5-5d73-5a1b-e356dd5a9698@gmail.com">
      <style type="text/css">pre { font-family: "Liberation Mono", monospace; font-size: 10pt; background: transparent none repeat scroll 0% 0%; }p { margin-bottom: 0.25cm; line-height: 115%; background: transparent none repeat scroll 0% 0%; }</style>
      <blockquote type="cite"
cite="mid:CAJ9hia-x+XZnH6UqZkHx5mkQqoagkNrMfGbziBLtup+_BfLhHg@mail.gmail.com"><br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Tue, May 14, 2019 at 8:09
            AM Cyrille &lt;<a href="mailto:lafricain79@gmail.com"
              moz-do-not-send="true">lafricain79@gmail.com</a>&gt;
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px&#xA;
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div bgcolor="#FFFFFF"> I send my message again because it
              was bigger.<br>
              <br>
              The conversion to UTF-8 is 99% solved!! I used a online
              converter:<br>
              <a
                class="gmail-m_2217136186459166179moz-txt-link-freetext"
href="https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html"
                target="_blank" moz-do-not-send="true">https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html</a><br>
              or:<br>
              <a
                class="gmail-m_2217136186459166179moz-txt-link-freetext"
href="http://burglish.my-mm.org/latest/trunk/web/fontconv.htm"
                target="_blank" moz-do-not-send="true">http://burglish.my-mm.org/latest/trunk/web/fontconv.htm</a><br>
              <br>
              See the result <a
href="https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA="
                target="_blank" moz-do-not-send="true">here</a>.<br>
              <br>
              Now the only problem is how to get the verse and chapter
              number... <br>
              <br>
              <br>
              <div class="gmail-m_2217136186459166179moz-cite-prefix">Il
                14/05/2019 13:53, Michael H ha scritto:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="ltr">
                    <div dir="ltr">
                      <div class="gmail_default"><font size="4"
                          face="garamond,&#xA; serif">Cyrille, (Peter), <br>
                          <br>
                          Maybe further discussion on this belongs in
                          Gitlab as issues.  Can I get added to this
                          project? <br>
                          <br>
                          Here are the first few lines of Matthew copied
                          from the PDF: </font><br>
                        ------<br>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">&amp;Sifrmaw;OD;
                          {0Ha*vdusrf;</div>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">The
                          Gospel According to Matthew</div>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">ed'gef;</div>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">usr;f
                          ûyy*k Kd¾v f &amp;iS rf maw;O;D \b0rwS wf r;f</div>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">usr;f
                          ûyy*k Kd¾v f &amp;iS rf maw;O;Don f
                          *gavav;,e,rf S*sL;vrl sK;d tmvaf z;O;D
                          \om;jzp\f / (rmu k2;14)</div>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">olonf
                          tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
                          a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm</div>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">av0djzp\f
                          / ool n f wad b;&amp;,d tidk tf e;DwGi f
                          a,Z;lociEf iS ahf wG U Ny;D<br>
                          <br>
                        </div>
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">-----</div>
                        <div class="gmail_default"><font size="4"
                            face="garamond,&#xA; serif">And here are the
                            first few lines of Matthew copied from the
                            Pagemaker file: </font></div>
                        <div class="gmail_default"><font size="4"
                            face="garamond,&#xA; serif">-----<br>
                          </font>
                          <div class="gmail_default"><font size="4"
                              face="garamond, serif">Sifrmaw;OD;
                              {0Ha*vdusrf;</font></div>
                          <div class="gmail_default"><font size="4"
                              face="garamond, serif">The Gospel
                              According to Matthew</font></div>
                          <div class="gmail_default"><span
                              style="font-family:garamond,serif;font-size:large">ed'gef;</span><br>
                          </div>
                          <div class="gmail_default"><span
                              style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf 
                              &amp;Sifrmaw;OD;\b0rSwfwrf;  </span><br>
                          </div>
                          <div class="gmail_default"><span
                              style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf 
                              &amp;Sifrmaw;OD;onf  *gavav;,e,frS
                              *sL;vlrsKd; tmvfaz;OD;\om;jzpf\/ (rmuk
                              2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/
                              (vk 5;27)
                              a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
                              ol\trnfrSm av0djzpf\/ olonf 
                              wdab;&amp;d,tkdifteD;wGif 
                              a,Zl;ocifESifhawGU  NyD;<br>
                              <br>
                              <br>
                              You can see that some letters have
                              changed, and some others are in a
                              different order. <br>
                              <br>
                            </span><span
                              style="font-family:garamond,serif;font-size:large">The
                              letters that change are likely those
                              points that aren't compatible with
                              unicode, and pagemaker reassigned them to
                              ensure that the file is more widely
                              viewable. Since a conversion is already
                              planned, these won't matter as much, but
                              the font embedded in the PDF is different
                              than the font attached to the pagemaker
                              file,  If you do start from the PDF,
                              you'll need to extract the font to get the
                              code points. </span><br
                              style="font-family:garamond,serif;font-size:large">
                            <span
                              style="font-family:garamond,serif;font-size:large"><br>
                              The problem is that the PDF export from
                              pagemaker sorts the letters into the order
                              they appear on the page.  Burmese text has
                              Indian style ligatures, where vowels tend
                              to jump over or under the previous
                              letters, sometimes back 2 or three
                              letters. If you study the following
                              snippets from the beginning of Matthew,
                              you can see there is a difference in
                              order, as well as some glyphs are
                              modified. <br>
                              <br>
                              So, from the PDF letters are out of order,
                              but from Pagemaker, letters are encoded
                              into control points. Fixing the control
                              points is easy and happens with the
                              unicode conversion.  Fixing the letter
                              order is not easy. You'll need a first
                              language speaker and plenty of time. </span></div>
                          <div class="gmail_default"><span
                              style="font-family:garamond,serif;font-size:large"><br>
                              The guidance I received on another group
                              was to use either LO Draw or Indesign to
                              export the text from Pagemaker.  I'll look
                              into LO Draw again, but I don't have
                              access to an older version of Indesign
                              (the pagemaker import was removed in
                              CS6). </span><span
                              style="font-family:garamond,serif;font-size:large"><br>
                            </span></div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
                <div dir="ltr">
                  <div class="gmail_default"
                    style="font-family:garamond,serif;font-size:large"><br>
                  </div>
                </div>
                <br>
                <div class="gmail_quote">
                  <div dir="ltr" class="gmail_attr">On Mon, May 13, 2019
                    at 10:40 AM Michael H &lt;<a
                      href="mailto:cmahte@gmail.com" target="_blank"
                      moz-do-not-send="true">cmahte@gmail.com</a>&gt;
                    wrote:<br>
                  </div>
                  <blockquote class="gmail_quote" style="margin:0px
                    0px&#xA; 0px 0.8ex;border-left:1px solid&#xA;
                    rgb(204,204,204);padding-left:1ex">
                    <div dir="ltr">
                      <div class="gmail_default"
                        style="font-family:garamond,serif;font-size:large">I
                        unzipped the pagemaker file, and when I open
                        NT_Proverb/Pagemaker (10.1mb), with a Hex
                        editor, I can 'find' all of the book names, and
                        see the text there.  <br>
                        <br>
                        To see the raw text: rename NT_Proverb.pmd &gt;
                        NT_Proverb.zip and open it with a zip archive
                        progeram.  The text is in the Pagemaker file at
                        the top level of the archive, but encoded with a
                        lot of extraneous information.  (The English
                        text "Matthew" appears at hex location
                        7A76972). <br>
                        <br>
                        When I open the fonts with fontforge, Fontforge
                        suggests the fonts are encoded as unicode (but
                        the glyphs are obviously not in the right
                        spot.) <br>
                        However when I copy the text (I copied from LO
                        Draw) and paste it into jedit and save that as
                        unicode: Reopening the file has a warning 'not
                        unicode, text may be missing'. <br>
                        <br>
                        So, what this means is that there are some
                        glyphs encoded into locations that unicode
                        treats as control or non-printing codes. The
                        text needs to be dealt with as a specific
                        encoding that matches whatever the original font
                        actually uses. I haven't figured out what the
                        original text files were encoded with. Without
                        that knowledge, I'm not sure my system clipboard
                        or editor (jedit) will properly respect the
                        glyphs in unusual locations until the conversion
                        to unicode, and I don't trust myself to be able
                        to detect if it is or is not properly
                        converted. <br>
                      </div>
                    </div>
                    <br>
                    <div class="gmail_quote">
                      <div dir="ltr" class="gmail_attr">On Mon, May 13,
                        2019 at 10:11 AM Cyrille &lt;<a
                          href="mailto:lafricain79@gmail.com"
                          target="_blank" moz-do-not-send="true">lafricain79@gmail.com</a>&gt;
                        wrote:<br>
                      </div>
                      <blockquote class="gmail_quote"
                        style="margin:0px&#xA; 0px 0px
                        0.8ex;border-left:1px solid&#xA;
                        rgb(204,204,204);padding-left:1ex">
                        <div bgcolor="#FFFFFF"> David,<br>
                          Probably you are right about <a
href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&amp;cat_id=TECkit"
                            target="_blank" moz-do-not-send="true">TECkit</a>,
                          if we get the text it will help us to convert
                          in UNICODE.<br>
                          About how to get the text, your method is out
                          of my skills :)<br>
                          I you succeed please let me know.<br>
                          <br>
                          <div
class="gmail-m_2217136186459166179gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-cite-prefix">Il
                            13/05/2019 16:21, David Haslam ha scritto:<br>
                          </div>
                          <blockquote type="cite">
                            <div>Given the insights from Michael Hart,
                              it may be feasible to temporarily
                              rearrange the main text stream as follows
                              :</div>
                            <div><br>
                            </div>
                            <div>1. Replace every EOL by a horizontal
                              tab. </div>
                            <div>2. Insert an EOL after each verse end
                              character. </div>
                            <div><br>
                            </div>
                            <div>Observe that the above two steps are
                              wholly reversible such that the original
                              text stream can be restored later. </div>
                            <div><br>
                            </div>
                            <div>In effect the text stream is now in
                              verse per line (VPL) layout, albeit
                              without verse tags. Some adjustments may
                              be necessary if there any section
                              headings, etc. </div>
                            <div><br>
                            </div>
                            <div>3. Add line numbers with the first
                              number being reset to 1 at the start of
                              each chapter, numbers incrementing by 1
                              for each line. </div>
                            <div>4. Add a left margin USFM verse tag \v_<br>
                            </div>
                            <div><br>
                            </div>
                            <div
id="gmail-m_2217136186459166179gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_mobile_signature_block">
                              <div>Steps 3&amp;4 can be implemented in
                                various ways. For my part, I’d use a
                                bespoke TextPipe filter. </div>
                              <div><br>
                              </div>
                              <div>Another method to consider might be
                                to use Excel formulae. I recall
                                resorting to such a method in the early
                                days of Go Bible. </div>
                              <div><br>
                              </div>
                              <div>Now restore the original layout by
                                reverting steps 2 &amp; 1, if this is
                                really necessary. That is, if the
                                original text layout appeared to be
                                paragraphed. </div>
                              <div><br>
                              </div>
                              <div>5. Decide how &amp; where to insert
                                paragraph tags. </div>
                              <div><br>
                              </div>
                              <div>6. Add chapter tags, book ID and main
                                title tags, etc. </div>
                              <div><br>
                              </div>
                              <div>Hope this gives some useful
                                suggestions that point towards a
                                practical solution. </div>
                              <div><br>
                              </div>
                              <div>Best regards </div>
                              <div><br>
                              </div>
                              <div>David</div>
                              <div><br>
                              </div>
                              <div><br>
                              </div>
                              <div>Sent from ProtonMail Mobile</div>
                            </div>
                            <div><br>
                            </div>
                            <div><br>
                            </div>
                            On Mon, May 13, 2019 at 14:57, Michael H
                            &lt;<a href="mailto:cmahte@gmail.com"
                              target="_blank" moz-do-not-send="true">cmahte@gmail.com</a>&gt;
                            wrote:
                            <blockquote
class="gmail-m_2217136186459166179gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_quote"
                              type="cite">
                              <div dir="ltr">
                                <div dir="ltr">
                                  <div dir="ltr">
                                    <div dir="ltr">
                                      <div class="gmail_default"
                                        style="font-family:garamond,serif;font-size:large">Cyrille<br>
                                        <br>
                                        LibreOffice Draw attempts to
                                        open the pagemaker file, with
                                        limited success. But it confirms
                                        that even in the pagemaker
                                        source, the verse numbers are a
                                        separate text stream. With this
                                        source, there is no way to copy
                                        the text with verse numbers
                                        intact. It appears to be stored
                                        with each book in it's own text
                                        stream. Each book is a separate
                                        text stream in the page maker
                                        file. LO Draw isn't rendering
                                        all of the pages, only the first
                                        10, So I've only explored
                                        Matthew further. <br>
                                        <br>
                                        Based on Matthew only, the
                                        verses seem to all end with the
                                        character "-" or ";/", which
                                        should aid in the
                                        reconstruction. I've looked
                                        through the PDF and this seems
                                        to be the case for all books
                                        visually as well. However, this
                                        isn't perfect: I find 1107 of
                                        these characters in Matthew,
                                        instead of the expected 1071
                                        verses.  But since the text
                                        stream has a book introduction,
                                        this is likely easily explained.
                                        Hopefully this gets you well
                                        down the path to creating a
                                        stream with verses. <br>
                                        <br>
                                        I would NOT start from the PDF
                                        file, but from the pagemaker
                                        file.  The PDF almost certainly
                                        has a lot of text rearranging
                                        and extra characters like page
                                        numbers and running heads. 
                                        Pagemaker has the book text in a
                                        single stream, in a form that
                                        will convert to unicode
                                        relatively easily. </div>
                                      <div class="gmail_default"
                                        style="font-family:garamond,serif;font-size:large"><br>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div><br>
                            </div>
                            <br>
                            <fieldset
class="gmail-m_2217136186459166179gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636mimeAttachmentHeader"></fieldset>
                            <pre class="gmail-m_2217136186459166179gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_2217136186459166179gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_2217136186459166179gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
                          </blockquote>
                          <br>
                        </div>
                        _______________________________________________<br>
                        sword-devel mailing list: <a
                          href="mailto:sword-devel@crosswire.org"
                          target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a><br>
                        <a
                          href="http://www.crosswire.org/mailman/listinfo/sword-devel"
                          rel="noreferrer" target="_blank"
                          moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
                        Instructions to unsubscribe/change your settings
                        at above page</blockquote>
                    </div>
                  </blockquote>
                </div>
                <br>
                <fieldset
                  class="gmail-m_2217136186459166179mimeAttachmentHeader"></fieldset>
                <pre class="gmail-m_2217136186459166179moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_2217136186459166179moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_2217136186459166179moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
              </blockquote>
              <br>
            </div>
            _______________________________________________<br>
            sword-devel mailing list: <a
              href="mailto:sword-devel@crosswire.org" target="_blank"
              moz-do-not-send="true">sword-devel@crosswire.org</a><br>
            <a
              href="http://www.crosswire.org/mailman/listinfo/sword-devel"
              rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
            Instructions to unsubscribe/change your settings at above
            page</blockquote>
        </div>
        <br>
        <fieldset class="mimeAttachmentHeader"></fieldset>
        <pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
      </blockquote>
      <br>
    </blockquote>
    <br>
  </body>
</html>