[sword-devel] XML versions of Thayer's or Strongs?

DM Smith dmsmith555 at yahoo.com
Sat Mar 11 12:32:13 MST 2006


I don't have access to Thayers. It is no longer available on CrossWire. 
So, I have to speak "theoretically" and hopefully you can find and fix 
the problems.

Sean wrote:
> Thanks, your detailed instructions and example (and a little puzzling 
> about how Java works, since i'm not a Java guy) produced some useful 
> results, as well as (of course!) a few more questions related to 
> running this with Thayer's.
>
> 1) there are various complaints: i'm not sure if they're significant
> org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring 
> unexpected entry in orthodoxy of sMinimumVersion
> org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
> entry in orthodoxy: CopyrightHolder=
> org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
> entry in orthodoxy: CopyrightDate=
> org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
> entry in orthodoxy: DistributionNotes=
> org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
> entry in rsv: CopyrightNotes=
> org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
> entry in rsv: CopyrightContactEmail=
> org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
> entry in rsv: DistributionNotes=
> org.crosswire.jsword.book.filter.thml.THMLFilter(INFO): Could not fix 
> it by cleaning tags: Illegal character or entity reference syntax.
>

JSword validates the conf files against what is expected or allowed. All 
of these "Ignoring" are warnings and can be ignored. Most of these have 
been cleaned up and will disappear if you download a fresh copy from the 
crosswire server.

The last one is that the input had characters that were out of range. 
Sword supports only two different encodings, CP1252 (called Latin 1) and 
UTF-8. If the encoding is UTF-8, then the conf needs to state that. 
Otherwise, it will interpret the input as CP1252.

If the module is something other than that you will need to re-encode 
the module into UTF-8.

> 2) the results from Thayer's seem to have lost the Greek characters. 
> What's in the .imp file looks like some 8-bit chars
> ωφελιμος
> which i assume is some kind of representation of the Greek characters 
> (haven't quite figured out what: doesn't seem to be UTF-8). But this 
> winds up in the output as a string of '?'s.

When you see a ? or a box in the output, you should verify that you are 
using a Unicode font or one that contains the unicode characters in the 
range that interest you.

>
> 3) entry 5207 (huios) produces bad XML: looks like a TDNT reference 
> attribute in a sync tag doesn't get its terminating quote (after 
> "8:400"?) and slash+angle bracket ending the sync are also missing:
> AV-son(s) 85, Son of Man +<sync type="Strongs" value="G444" /> 87 
> (<sync type="TDNT" value="8:400, 1210), Son of God
> The fault seems to exist in the .imp file as well (which has these 
> <sync> tags embedded)

JSword assumes that the module is good ThML in the first place. If this 
is not the case, it will have to be fixed and the module re-created. If

>
> 4) there are a number of bare "&" characters in the original which 
> seem to get dropped in the output instead of replaced with &amp; 
> (except for one in #5207, one might suppose because of the 
> unterminated attribute/tag issue)

If the original is ThML and the & are not escaped, these will need to be 
fixed in the original.

>
> 5) There are some issues with the synonym references around ampersands 
> (whether related to #4 i can't tell): the .imp file has
> For Synonyms see entry <sync type="Strongs" value="G5811" /> & <sync 
> type="Strongs" value="G5889" />
> but the OSISified output has
> <w lemma='strong:G5811'>
> For Synonyms see entry </w><w lemma='strong:G5889'> </w>
>
> Hope this feedback is helpful, and thanks again for the pointers. 
> Unless there's a solution to the problem with the Greek characters, 
> i'll have to fall back to parsing the .imp file by hand, since getting 
> these out is important to me. By the way, what displays in the Sword 
> Project for Thayer's lacks accents and breathing marks, though by 
> comparison i see them in e-Sword's version: anyone happen to know why?
>
> His,
> Sean



More information about the sword-devel mailing list