<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <br>
    <br>
    <div class="moz-cite-prefix">Il 14/05/2019 22:48, Michael H ha
      scritto:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAJ9hia-speB1UPxm+CofuJg6L7VoT6mfx8bsQsNkYshEO-_Prw@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div class="gmail_default"
          style="font-family:garamond,serif;font-size:large">You should
          be able to configure a regex search to find the verse
          boundaries. <br>
          <br>
          Once you have verse boundaries, if you configure the text into
          Verse per line it should be possible to assign each row a
          chapter and verse number from a reference. That is, the 3341
          verse in the New Testament is usually John 20:31 (I don't have
          that memorized, just an example.) <br>
        </div>
      </div>
    </blockquote>
    <br>
    I have no idea how to do this :)<br>
    <blockquote type="cite"
cite="mid:CAJ9hia-speB1UPxm+CofuJg6L7VoT6mfx8bsQsNkYshEO-_Prw@mail.gmail.com"><br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Tue, May 14, 2019 at 3:22
          PM Cyrille &lt;<a href="mailto:lafricain79@gmail.com"
            moz-do-not-send="true">lafricain79@gmail.com</a>&gt; wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div bgcolor="#FFFFFF"> Ok thank you!  I have already all the
            text in unicode but without the verse numbers and
            chapters... I begun manually...<br>
            <br>
            <div class="gmail-m_-4094282784364978796moz-cite-prefix">Il
              14/05/2019 22:17, David Haslam ha scritto:<br>
            </div>
            <blockquote type="cite">
              <div>Hi Cyrille </div>
              <div><br>
              </div>
              <div>If I can find the time tomorrow or later, I’ll have a
                look at what might be feasible. </div>
              <div><br>
              </div>
              <div>Thanks for all these useful links. </div>
              <div><br>
              </div>
              <div>David</div>
              <div><br>
              </div>
              <div
                id="gmail-m_-4094282784364978796protonmail_mobile_signature_block">
                <div>Sent from ProtonMail Mobile</div>
              </div>
              <div><br>
              </div>
              <div><br>
              </div>
              On Tue, May 14, 2019 at 14:08, Cyrille &lt;<a
                href="mailto:lafricain79@gmail.com" target="_blank"
                moz-do-not-send="true">lafricain79@gmail.com</a>&gt;
              wrote:
              <blockquote
                class="gmail-m_-4094282784364978796protonmail_quote"
                type="cite"> I send my message again because it was
                bigger.<br>
                <br>
                The conversion to UTF-8 is 99% solved!! I used a online
                converter:<br>
                <a
                  class="gmail-m_-4094282784364978796moz-txt-link-freetext"
href="https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html"
                  target="_blank" moz-do-not-send="true">https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html</a><br>
                or:<br>
                <a
                  class="gmail-m_-4094282784364978796moz-txt-link-freetext"
href="http://burglish.my-mm.org/latest/trunk/web/fontconv.htm"
                  target="_blank" moz-do-not-send="true">http://burglish.my-mm.org/latest/trunk/web/fontconv.htm</a><br>
                <br>
                See the result <a
href="https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA="
                  target="_blank" moz-do-not-send="true">here</a>.<br>
                <br>
                Now the only problem is how to get the verse and chapter
                number... <br>
                <br>
                <br>
                <div class="gmail-m_-4094282784364978796moz-cite-prefix">Il
                  14/05/2019 13:53, Michael H ha scritto:<br>
                </div>
                <blockquote type="cite">
                  <div dir="ltr">
                    <div dir="ltr">
                      <div dir="ltr">
                        <div class="gmail_default"><font size="4"
                            face="garamond,&#xA; serif">Cyrille,
                            (Peter), <br>
                            <br>
                            Maybe further discussion on this belongs in
                            Gitlab as issues.  Can I get added to this
                            project? <br>
                            <br>
                            Here are the first few lines of Matthew
                            copied from the PDF: </font><br>
                          ------<br>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">&amp;Sifrmaw;OD;
                            {0Ha*vdusrf;</div>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">The
                            Gospel According to Matthew</div>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">ed'gef;</div>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">usr;f
                            ûyy*k Kd¾v f &amp;iS rf maw;O;D \b0rwS wf
                            r;f</div>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">usr;f
                            ûyy*k Kd¾v f &amp;iS rf maw;O;Don f
                            *gavav;,e,rf S*sL;vrl sK;d tmvaf z;O;D
                            \om;jzp\f / (rmu k2;14)</div>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">olonf
                            tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
                            a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm</div>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">av0djzp\f
                            / ool n f wad b;&amp;,d tidk tf e;DwGi f
                            a,Z;lociEf iS ahf wG U Ny;D<br>
                            <br>
                          </div>
                          <div class="gmail_default"
                            style="font-family:garamond,serif;font-size:large">-----</div>
                          <div class="gmail_default"><font size="4"
                              face="garamond,&#xA; serif">And here are
                              the first few lines of Matthew copied from
                              the Pagemaker file: </font></div>
                          <div class="gmail_default"><font size="4"
                              face="garamond,&#xA; serif">-----<br>
                            </font>
                            <div class="gmail_default"><font size="4"
                                face="garamond, serif">Sifrmaw;OD;
                                {0Ha*vdusrf;</font></div>
                            <div class="gmail_default"><font size="4"
                                face="garamond, serif">The Gospel
                                According to Matthew</font></div>
                            <div class="gmail_default"><span
                                style="font-family:garamond,serif;font-size:large">ed'gef;</span><br>
                            </div>
                            <div class="gmail_default"><span
                                style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf 
                                &amp;Sifrmaw;OD;\b0rSwfwrf;  </span><br>
                            </div>
                            <div class="gmail_default"><span
                                style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf 
                                &amp;Sifrmaw;OD;onf  *gavav;,e,frS
                                *sL;vlrsKd; tmvfaz;OD;\om;jzpf\/ (rmuk
                                2;14) olonf  tcGefcHoltjzpf
                                trIxrf;chJonf/ (vk 5;27)
                                a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
                                ol\trnfrSm av0djzpf\/ olonf 
                                wdab;&amp;d,tkdifteD;wGif 
                                a,Zl;ocifESifhawGU  NyD;<br>
                                <br>
                                <br>
                                You can see that some letters have
                                changed, and some others are in a
                                different order. <br>
                                <br>
                              </span><span
                                style="font-family:garamond,serif;font-size:large">The
                                letters that change are likely those
                                points that aren't compatible with
                                unicode, and pagemaker reassigned them
                                to ensure that the file is more widely
                                viewable. Since a conversion is already
                                planned, these won't matter as much, but
                                the font embedded in the PDF is
                                different than the font attached to the
                                pagemaker file,  If you do start from
                                the PDF, you'll need to extract the font
                                to get the code points. </span><br
                                style="font-family:garamond,serif;font-size:large">
                              <span
                                style="font-family:garamond,serif;font-size:large"><br>
                                The problem is that the PDF export from
                                pagemaker sorts the letters into the
                                order they appear on the page.  Burmese
                                text has Indian style ligatures, where
                                vowels tend to jump over or under the
                                previous letters, sometimes back 2 or
                                three letters. If you study the
                                following snippets from the beginning of
                                Matthew, you can see there is a
                                difference in order, as well as some
                                glyphs are modified. <br>
                                <br>
                                So, from the PDF letters are out of
                                order, but from Pagemaker, letters are
                                encoded into control points. Fixing the
                                control points is easy and happens with
                                the unicode conversion.  Fixing the
                                letter order is not easy. You'll need a
                                first language speaker and plenty of
                                time. </span></div>
                            <div class="gmail_default"><span
                                style="font-family:garamond,serif;font-size:large"><br>
                                The guidance I received on another group
                                was to use either LO Draw or Indesign to
                                export the text from Pagemaker.  I'll
                                look into LO Draw again, but I don't
                                have access to an older version of
                                Indesign (the pagemaker import was
                                removed in CS6). </span><span
                                style="font-family:garamond,serif;font-size:large"><br>
                              </span></div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                  <div dir="ltr">
                    <div class="gmail_default"
                      style="font-family:garamond,serif;font-size:large"><br>
                    </div>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr" class="gmail_attr">On Mon, May 13,
                      2019 at 10:40 AM Michael H &lt;<a
                        href="mailto:cmahte@gmail.com" target="_blank"
                        moz-do-not-send="true">cmahte@gmail.com</a>&gt;
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex">
                      <div dir="ltr">
                        <div class="gmail_default"
                          style="font-family:garamond,serif;font-size:large">I
                          unzipped the pagemaker file, and when I open
                          NT_Proverb/Pagemaker (10.1mb), with a Hex
                          editor, I can 'find' all of the book names,
                          and see the text there.  <br>
                          <br>
                          To see the raw text: rename NT_Proverb.pmd
                          &gt; NT_Proverb.zip and open it with a zip
                          archive progeram.  The text is in the
                          Pagemaker file at the top level of the
                          archive, but encoded with a lot of extraneous
                          information.  (The English text "Matthew"
                          appears at hex location 7A76972). <br>
                          <br>
                          When I open the fonts with fontforge,
                          Fontforge suggests the fonts are encoded as
                          unicode (but the glyphs are obviously not in
                          the right spot.) <br>
                          However when I copy the text (I copied from LO
                          Draw) and paste it into jedit and save that as
                          unicode: Reopening the file has a warning 'not
                          unicode, text may be missing'. <br>
                          <br>
                          So, what this means is that there are some
                          glyphs encoded into locations that unicode
                          treats as control or non-printing codes. The
                          text needs to be dealt with as a specific
                          encoding that matches whatever the original
                          font actually uses. I haven't figured out what
                          the original text files were encoded with.
                          Without that knowledge, I'm not sure my system
                          clipboard or editor (jedit) will properly
                          respect the glyphs in unusual locations until
                          the conversion to unicode, and I don't trust
                          myself to be able to detect if it is or is not
                          properly converted. <br>
                        </div>
                      </div>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr" class="gmail_attr">On Mon, May
                          13, 2019 at 10:11 AM Cyrille &lt;<a
                            href="mailto:lafricain79@gmail.com"
                            target="_blank" moz-do-not-send="true">lafricain79@gmail.com</a>&gt;
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote"
                          style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">
                          <div bgcolor="#FFFFFF"> David,<br>
                            Probably you are right about <a
href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&amp;cat_id=TECkit"
                              target="_blank" moz-do-not-send="true">TECkit</a>,
                            if we get the text it will help us to
                            convert in UNICODE.<br>
                            About how to get the text, your method is
                            out of my skills :)<br>
                            I you succeed please let me know.<br>
                            <br>
                            <div
class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-cite-prefix">Il
                              13/05/2019 16:21, David Haslam ha scritto:<br>
                            </div>
                            <blockquote type="cite">
                              <div>Given the insights from Michael Hart,
                                it may be feasible to temporarily
                                rearrange the main text stream as
                                follows :</div>
                              <div><br>
                              </div>
                              <div>1. Replace every EOL by a horizontal
                                tab. </div>
                              <div>2. Insert an EOL after each verse end
                                character. </div>
                              <div><br>
                              </div>
                              <div>Observe that the above two steps are
                                wholly reversible such that the original
                                text stream can be restored later. </div>
                              <div><br>
                              </div>
                              <div>In effect the text stream is now in
                                verse per line (VPL) layout, albeit
                                without verse tags. Some adjustments may
                                be necessary if there any section
                                headings, etc. </div>
                              <div><br>
                              </div>
                              <div>3. Add line numbers with the first
                                number being reset to 1 at the start of
                                each chapter, numbers incrementing by 1
                                for each line. </div>
                              <div>4. Add a left margin USFM verse tag
                                \v_<br>
                              </div>
                              <div><br>
                              </div>
                              <div
id="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_mobile_signature_block">
                                <div>Steps 3&amp;4 can be implemented in
                                  various ways. For my part, I’d use a
                                  bespoke TextPipe filter. </div>
                                <div><br>
                                </div>
                                <div>Another method to consider might be
                                  to use Excel formulae. I recall
                                  resorting to such a method in the
                                  early days of Go Bible. </div>
                                <div><br>
                                </div>
                                <div>Now restore the original layout by
                                  reverting steps 2 &amp; 1, if this is
                                  really necessary. That is, if the
                                  original text layout appeared to be
                                  paragraphed. </div>
                                <div><br>
                                </div>
                                <div>5. Decide how &amp; where to insert
                                  paragraph tags. </div>
                                <div><br>
                                </div>
                                <div>6. Add chapter tags, book ID and
                                  main title tags, etc. </div>
                                <div><br>
                                </div>
                                <div>Hope this gives some useful
                                  suggestions that point towards a
                                  practical solution. </div>
                                <div><br>
                                </div>
                                <div>Best regards </div>
                                <div><br>
                                </div>
                                <div>David</div>
                                <div><br>
                                </div>
                                <div><br>
                                </div>
                                <div>Sent from ProtonMail Mobile</div>
                              </div>
                              <div><br>
                              </div>
                              <div><br>
                              </div>
                              On Mon, May 13, 2019 at 14:57, Michael H
                              &lt;<a href="mailto:cmahte@gmail.com"
                                target="_blank" moz-do-not-send="true">cmahte@gmail.com</a>&gt;
                              wrote:
                              <blockquote
class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_quote"
                                type="cite">
                                <div dir="ltr">
                                  <div dir="ltr">
                                    <div dir="ltr">
                                      <div dir="ltr">
                                        <div class="gmail_default"
                                          style="font-family:garamond,serif;font-size:large">Cyrille<br>
                                          <br>
                                          LibreOffice Draw attempts to
                                          open the pagemaker file, with
                                          limited success. But it
                                          confirms that even in the
                                          pagemaker source, the verse
                                          numbers are a separate text
                                          stream. With this source,
                                          there is no way to copy the
                                          text with verse numbers
                                          intact. It appears to be
                                          stored with each book in it's
                                          own text stream. Each book is
                                          a separate text stream in the
                                          page maker file. LO Draw isn't
                                          rendering all of the pages,
                                          only the first 10, So I've
                                          only explored Matthew
                                          further. <br>
                                          <br>
                                          Based on Matthew only, the
                                          verses seem to all end with
                                          the character "-" or ";/",
                                          which should aid in the
                                          reconstruction. I've looked
                                          through the PDF and this seems
                                          to be the case for all books
                                          visually as well. However,
                                          this isn't perfect: I find
                                          1107 of these characters in
                                          Matthew, instead of the
                                          expected 1071 verses.  But
                                          since the text stream has a
                                          book introduction, this is
                                          likely easily explained.
                                          Hopefully this gets you well
                                          down the path to creating a
                                          stream with verses. <br>
                                          <br>
                                          I would NOT start from the PDF
                                          file, but from the pagemaker
                                          file.  The PDF almost
                                          certainly has a lot of text
                                          rearranging and extra
                                          characters like page numbers
                                          and running heads.  Pagemaker
                                          has the book text in a single
                                          stream, in a form that will
                                          convert to unicode relatively
                                          easily. </div>
                                        <div class="gmail_default"
                                          style="font-family:garamond,serif;font-size:large"><br>
                                        </div>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                              </blockquote>
                              <div><br>
                              </div>
                              <div><br>
                              </div>
                              <br>
                              <fieldset
class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636mimeAttachmentHeader"></fieldset>
                              <pre class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
                            </blockquote>
                            <br>
                          </div>
_______________________________________________<br>
                          sword-devel mailing list: <a
                            href="mailto:sword-devel@crosswire.org"
                            target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a><br>
                          <a
                            href="http://www.crosswire.org/mailman/listinfo/sword-devel"
                            rel="noreferrer" target="_blank"
                            moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
                          Instructions to unsubscribe/change your
                          settings at above page</blockquote>
                      </div>
                    </blockquote>
                  </div>
                  <br>
                  <fieldset
                    class="gmail-m_-4094282784364978796mimeAttachmentHeader"></fieldset>
                  <pre class="gmail-m_-4094282784364978796moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_-4094282784364978796moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_-4094282784364978796moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
                </blockquote>
                <br>
              </blockquote>
              <div><br>
              </div>
              <div><br>
              </div>
              <br>
              <fieldset
                class="gmail-m_-4094282784364978796mimeAttachmentHeader"></fieldset>
              <pre class="gmail-m_-4094282784364978796moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_-4094282784364978796moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_-4094282784364978796moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
            </blockquote>
            <br>
          </div>
          _______________________________________________<br>
          sword-devel mailing list: <a
            href="mailto:sword-devel@crosswire.org" target="_blank"
            moz-do-not-send="true">sword-devel@crosswire.org</a><br>
          <a
            href="http://www.crosswire.org/mailman/listinfo/sword-devel"
            rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
          Instructions to unsubscribe/change your settings at above page</blockquote>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
    </blockquote>
    <br>
  </body>
</html>