<div>Hi&nbsp;Cyrille&nbsp;</div><div><br></div><div>If I can find the time tomorrow or later, I’ll have a look at what might be feasible.&nbsp;</div><div><br></div><div>Thanks for all these useful links.&nbsp;</div><div><br></div><div>David</div><div><br></div><div id="protonmail_mobile_signature_block"><div>Sent from ProtonMail Mobile</div></div> <div><br></div><div><br></div>On Tue, May 14, 2019 at 14:08, Cyrille &lt;<a href="mailto:lafricain79@gmail.com" class="">lafricain79@gmail.com</a>&gt; wrote:<blockquote class="protonmail_quote" type="cite">




    I send my message again because it was bigger.<br>
    <br>
    The conversion to UTF-8 is 99% solved!! I used a online converter:<br>
    <a class="moz-txt-link-freetext" href="https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html">https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html</a><br>
    or:<br>
    <a class="moz-txt-link-freetext" href="http://burglish.my-mm.org/latest/trunk/web/fontconv.htm">http://burglish.my-mm.org/latest/trunk/web/fontconv.htm</a><br>
    <br>
    See the result <a href="https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=">here</a>.<br>
    <br>
    Now the only problem is how to get the verse and chapter number... <br>
    <br>
    <br>
    <div class="moz-cite-prefix">Il 14/05/2019 13:53, Michael H ha
      scritto:<br>
    </div>
    <blockquote type="cite">

      <div dir="ltr">
        <div dir="ltr">
          <div dir="ltr">
            <div class="gmail_default"><font size="4" face="garamond,
                serif">Cyrille, (Peter),&nbsp;<br>
                <br>
                Maybe further discussion on this belongs in Gitlab as
                issues.&nbsp; Can I get added to this project?&nbsp;<br>
                <br>
                Here are the first few lines of Matthew copied from the
                PDF:&nbsp;</font><br>
              ------<br>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">&amp;Sifrmaw;OD;
                {0Ha*vdusrf;</div>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">The
                Gospel According to Matthew</div>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">ed'gef;</div>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">usr;f
                ûyy*k Kd¾v f &amp;iS rf maw;O;D \b0rwS wf r;f</div>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">usr;f
                ûyy*k Kd¾v f &amp;iS rf maw;O;Don f *gavav;,e,rf
                S*sL;vrl sK;d tmvaf z;O;D \om;jzp\f / (rmu k2;14)</div>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">olonf
                tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
                a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm</div>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">av0djzp\f
                / ool n f wad b;&amp;,d tidk tf e;DwGi f a,Z;lociEf iS
                ahf wG U Ny;D<br>
                <br>
              </div>
              <div class="gmail_default" style="font-family:garamond,serif;font-size:large">-----</div>
              <div class="gmail_default"><font size="4" face="garamond,
                  serif">And here are the first few lines of Matthew
                  copied from the Pagemaker file:&nbsp;</font></div>
              <div class="gmail_default"><font size="4" face="garamond,
                  serif">-----<br>
                </font>
                <div class="gmail_default"><font size="4" face="garamond, serif">Sifrmaw;OD; {0Ha*vdusrf;</font></div>
                <div class="gmail_default"><font size="4" face="garamond, serif">The Gospel According to
                    Matthew</font></div>
                <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">ed'gef;</span><br>
                </div>
                <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf&nbsp;
                    &amp;Sifrmaw;OD;\b0rSwfwrf;&nbsp;&nbsp;</span><br>
                </div>
                <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf&nbsp;
                    &amp;Sifrmaw;OD;onf&nbsp; *gavav;,e,frS *sL;vlrsKd;
                    tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf&nbsp;
                    tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
                    a,Zl;ocif\aemufvdkufwynfhrjzpfrD&nbsp; ol\trnfrSm
                    av0djzpf\/ olonf&nbsp; wdab;&amp;d,tkdifteD;wGif&nbsp;
                    a,Zl;ocifESifhawGU&nbsp; NyD;<br>
                    <br>
                    <br>
                    You can see that some letters have changed, and some
                    others are in a different order.&nbsp;<br>
                    <br>
                  </span><span style="font-family:garamond,serif;font-size:large">The
                    letters that change are likely those points that
                    aren't compatible with unicode, and pagemaker
                    reassigned them to ensure that the file is more
                    widely viewable. Since a conversion is already
                    planned, these won't matter as much, but the font
                    embedded in the PDF is different than the font
                    attached to the pagemaker file,&nbsp; If you do start
                    from the PDF, you'll need to extract the font to get
                    the code points.&nbsp;</span><br style="font-family:garamond,serif;font-size:large">
                  <span style="font-family:garamond,serif;font-size:large"><br>
                    The problem is that the PDF export from pagemaker
                    sorts the letters into the order they appear on the
                    page.&nbsp; Burmese text has Indian style ligatures,
                    where vowels tend to jump over or under the previous
                    letters, sometimes back 2 or three letters. If you
                    study the following snippets from the beginning of
                    Matthew, you can see there is a difference in order,
                    as well as some glyphs are modified.&nbsp;<br>
                    <br>
                    So, from the PDF letters are out of order, but from
                    Pagemaker, letters are encoded into control points.
                    Fixing the control points is easy and happens with
                    the unicode conversion.&nbsp; Fixing the letter order is
                    not easy. You'll need a first language speaker and
                    plenty of time.&nbsp;</span></div>
                <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large"><br>
                    The guidance I received on another group was to use
                    either LO Draw or Indesign to export the text from
                    Pagemaker.&nbsp; I'll look into LO Draw again, but I
                    don't have access to an older version of Indesign
                    (the pagemaker import was removed in CS6).&nbsp;</span><span style="font-family:garamond,serif;font-size:large"><br>
                  </span></div>
              </div>
            </div>
          </div>
        </div>
      </div>
      <div dir="ltr">
        <div class="gmail_default" style="font-family:garamond,serif;font-size:large"><br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Mon, May 13, 2019 at 10:40
          AM Michael H &lt;<a href="mailto:cmahte@gmail.com">cmahte@gmail.com</a>&gt;
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="ltr">
            <div class="gmail_default" style="font-family:garamond,serif;font-size:large">I
              unzipped the pagemaker file, and when I open
              NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can
              'find' all of the book names, and see the text there.&nbsp;&nbsp;<br>
              <br>
              To see the raw text: rename NT_Proverb.pmd &gt;
              NT_Proverb.zip and open it with a zip archive progeram.&nbsp;
              The text is in the Pagemaker file at the top level of the
              archive, but encoded with a lot of extraneous
              information.&nbsp; (The English text "Matthew" appears at hex
              location 7A76972).&nbsp;<br>
              <br>
              When I open the fonts with fontforge, Fontforge suggests
              the fonts are encoded as unicode (but the glyphs are
              obviously not in the right spot.)&nbsp;<br>
              However when I copy the text (I copied from LO Draw) and
              paste it into jedit and save that as unicode: Reopening
              the file has a warning 'not unicode, text may be
              missing'.&nbsp;<br>
              <br>
              So, what this means is that there are some glyphs encoded
              into locations that unicode treats as control or
              non-printing codes. The text needs to be dealt with as a
              specific encoding that matches whatever the original font
              actually uses. I haven't figured out what the original
              text files were encoded with. Without that knowledge, I'm
              not sure my system clipboard or editor (jedit) will
              properly respect the glyphs in unusual locations until the
              conversion to unicode, and I don't trust myself to be able
              to detect if it is or is not properly converted.&nbsp;<br>
            </div>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Mon, May 13, 2019 at
              10:11 AM Cyrille &lt;<a href="mailto:lafricain79@gmail.com">lafricain79@gmail.com</a>&gt;
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF"> David,<br>
                Probably you are right about <a href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&amp;cat_id=TECkit">TECkit</a>, if
                we get the text it will help us to convert in UNICODE.<br>
                About how to get the text, your method is out of my
                skills :)<br>
                I you succeed please let me know.<br>
                <br>
                <div class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-cite-prefix">Il
                  13/05/2019 16:21, David Haslam ha scritto:<br>
                </div>
                <blockquote type="cite">
                  <div>Given the insights from Michael Hart, it may be
                    feasible to temporarily rearrange the main text
                    stream as follows :</div>
                  <div><br>
                  </div>
                  <div>1. Replace every EOL by a horizontal tab.&nbsp;</div>
                  <div>2. Insert an EOL after each verse end character.&nbsp;</div>
                  <div><br>
                  </div>
                  <div>Observe that the above two steps are
                    wholly&nbsp;reversible such that the original text stream
                    can be restored later.&nbsp;</div>
                  <div><br>
                  </div>
                  <div>In effect the text stream is now in verse per
                    line (VPL) layout, albeit without verse tags. Some
                    adjustments may be necessary if there any section
                    headings, etc.&nbsp;</div>
                  <div><br>
                  </div>
                  <div>3. Add line numbers with the first number being
                    reset to 1 at the start of each chapter, numbers
                    incrementing by 1 for each line.&nbsp;</div>
                  <div>4. Add a left margin USFM verse tag \v_<br>
                  </div>
                  <div><br>
                  </div>
                  <div id="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_mobile_signature_block">
                    <div>Steps 3&amp;4 can be implemented in various
                      ways. For my part, I’d use a bespoke TextPipe
                      filter.&nbsp;</div>
                    <div><br>
                    </div>
                    <div>Another method to consider might be to use
                      Excel formulae. I recall resorting to such a
                      method in the early days of Go Bible.&nbsp;</div>
                    <div><br>
                    </div>
                    <div>Now restore the original layout by reverting
                      steps 2 &amp; 1, if this is really necessary. That
                      is, if the original text layout appeared to be
                      paragraphed.&nbsp;</div>
                    <div><br>
                    </div>
                    <div>5. Decide how &amp; where to insert paragraph
                      tags.&nbsp;</div>
                    <div><br>
                    </div>
                    <div>6. Add chapter tags, book ID and main title
                      tags, etc.&nbsp;</div>
                    <div><br>
                    </div>
                    <div>Hope this gives some useful suggestions that
                      point towards a practical solution.&nbsp;</div>
                    <div><br>
                    </div>
                    <div>Best regards&nbsp;</div>
                    <div><br>
                    </div>
                    <div>David</div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div>Sent from ProtonMail Mobile</div>
                  </div>
                  <div><br>
                  </div>
                  <div><br>
                  </div>
                  On Mon, May 13, 2019 at 14:57, Michael H &lt;<a href="mailto:cmahte@gmail.com">cmahte@gmail.com</a>&gt;
                  wrote:
                  <blockquote class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_quote" type="cite">
                    <div dir="ltr">
                      <div dir="ltr">
                        <div dir="ltr">
                          <div dir="ltr">
                            <div class="gmail_default" style="font-family:garamond,serif;font-size:large">Cyrille<br>
                              <br>
                              LibreOffice Draw attempts to open the
                              pagemaker file, with limited success. But
                              it confirms that even in the pagemaker
                              source, the verse numbers are a separate
                              text stream. With this source, there is no
                              way to copy the text with verse numbers
                              intact. It appears to be stored with each
                              book in it's own text stream. Each book is
                              a separate text stream in the page maker
                              file. LO Draw isn't rendering all of the
                              pages, only the first 10, So I've only
                              explored Matthew further.&nbsp;<br>
                              <br>
                              Based on Matthew only, the verses seem to
                              all end with the character "-" or ";/",
                              which should aid in the reconstruction.
                              I've looked through the PDF and this seems
                              to be the case for all books visually as
                              well. However, this isn't perfect: I find
                              1107 of these characters in Matthew,
                              instead of the expected 1071 verses.&nbsp; But
                              since the text stream has a book
                              introduction, this is likely easily
                              explained. Hopefully this gets you well
                              down the path to creating a stream with
                              verses.&nbsp;<br>
                              <br>
                              I would NOT start from the PDF file, but
                              from the pagemaker file.&nbsp; The PDF almost
                              certainly has a lot of text rearranging
                              and extra characters like page numbers and
                              running heads.&nbsp; Pagemaker has the book
                              text in a single stream, in a form that
                              will convert to unicode relatively
                              easily.&nbsp;</div>
                            <div class="gmail_default" style="font-family:garamond,serif;font-size:large"><br>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </blockquote>
                  <div><br>
                  </div>
                  <div><br>
                  </div>
                  <br>
                  <fieldset class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636mimeAttachmentHeader"></fieldset>
                  <pre class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
                </blockquote>
                <br>
              </div>
              _______________________________________________<br>
              sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
              <a href="http://www.crosswire.org/mailman/listinfo/sword-devel" rel="noreferrer">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
              Instructions to unsubscribe/change your settings at above
              page</blockquote>
          </div>
        </blockquote>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
    </blockquote>
    <br>


</blockquote><div><br></div><div><br></div>