Module Tools
  1. Module Tools
  2. MODTOOLS-41

Update usfm2osis.py to cover the new Paratext feature of nested tags using \+

    Details

    • Type: New Feature New Feature
    • Status: Open (View Workflow)
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: usfm2osis.py
    • Labels:
      None
    • Environment:

      N/A

      Description

      The latest version of Paratext supports the concept of nested tags identified using

       \+ 

      Most of these are typically to be found in footnotes.

      Here's an example from the Welsh beibl.net translation.

       \v 13 Ond cyn mynd, galwodd ddeg o'i weision ato a rhannu swm o arian\f + \fr 19:13 \fq swm o arian: \ft Groeg, “10 \+tl mina\+tl*”. Roedd un \+tl mina\+tl* yn werth 100 denariws, sef cyflog tua tri mis.\f* rhyngddyn nhw. ‘Defnyddiwch yr arian yma i farchnata ar fy rhan, nes do i yn ôl adre,’ meddai. 

      I've asked Jeff Klassen to let me know when this enhancement will be covered by updated USFM documentation.

      1. merged.footnotes.txt
        180 kB
        David Haslam
      2. merged.sfm.tags.count.txt
        3 kB
        David Haslam
      3. merged.usfm.tags.count.txt
        3 kB
        David Haslam

        Activity

        Hide
        David Haslam added a comment - - edited

        The OSIS XML file for the Welsh beibl.net made using usfm2osis.py contains 722 matches to "+".

        It fails an XML syntax check because of another issue, to be reported separately.

        Show
        David Haslam added a comment - - edited The OSIS XML file for the Welsh beibl.net made using usfm2osis.py contains 722 matches to "+". It fails an XML syntax check because of another issue, to be reported separately.
        Hide
        David Haslam added a comment - - edited

        Just attached the USFM Tag Statistics for the latest files.

        This may differ slightly from the stats for 2013.

        Show
        David Haslam added a comment - - edited Just attached the USFM Tag Statistics for the latest files. This may differ slightly from the stats for 2013.
        Hide
        David Haslam added a comment - - edited

        I have made a copy of the SFM files in which each "+" was replaced by "\". (I will post this to the box folder.)

        Running usfm2osis.py on these files did not report any unhandled USFM tags, but the XML file fails a syntax check at line 5558.

        Looks like the syntax fail may be related to processing USFM tables.

        Line 5558 is part of Numbers 2:25-31 which is laid out as a table in the USFM.

        I have therefore created a new issue for this discovery. See

        http://www.crosswire.org/tracker/browse/MODTOOLS-82

        Show
        David Haslam added a comment - - edited I have made a copy of the SFM files in which each "+" was replaced by "\". (I will post this to the box folder.) Running usfm2osis.py on these files did not report any unhandled USFM tags, but the XML file fails a syntax check at line 5558. Looks like the syntax fail may be related to processing USFM tables. Line 5558 is part of Numbers 2:25-31 which is laid out as a table in the USFM. I have therefore created a new issue for this discovery. See http://www.crosswire.org/tracker/browse/MODTOOLS-82
        Hide
        David Haslam added a comment -

        Attaching a text file containing all the USFM footnotes extracted from the Welsh beibl.net translation.

        This is where many of the character level nested tags are found.

        Show
        David Haslam added a comment - Attaching a text file containing all the USFM footnotes extracted from the Welsh beibl.net translation. This is where many of the character level nested tags are found.
        Hide
        David Haslam added a comment -

        Apart from one instance of "+nd_+nd*", all the nested tags in the beibl.net translation are located within footnotes.

        Count SFM tag Description
        ----- -------- -----------
        00004 +it Italics text style begin (nested)
        00004 +it* Italics text style end (nested)
        00025 +nd Name of deity begin (nested)
        00025 +nd* Name of deity end (nested)
        00003 +qt Quoted text begin (nested)
        00003 +qt* Quoted text end (nested)
        00206 +sc Small-cap text begin (nested)
        00206 +sc* Small-cap text end (nested)
        00122 +tl Transliterated (or foreign) word[s] begin (nested)
        00122 +tl* Transliterated (or foreign) word[s] end (nested)

        Show
        David Haslam added a comment - Apart from one instance of "+nd_+nd*", all the nested tags in the beibl.net translation are located within footnotes. Count SFM tag Description ----- -------- ----------- 00004 +it Italics text style begin (nested) 00004 +it* Italics text style end (nested) 00025 +nd Name of deity begin (nested) 00025 +nd* Name of deity end (nested) 00003 +qt Quoted text begin (nested) 00003 +qt* Quoted text end (nested) 00206 +sc Small-cap text begin (nested) 00206 +sc* Small-cap text end (nested) 00122 +tl Transliterated (or foreign) word [s] begin (nested) 00122 +tl* Transliterated (or foreign) word [s] end (nested)

          People

          • Assignee:
            Chris Little
            Reporter:
            David Haslam
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated: