[sword-devel] XML versions of Thayer's or Strongs?

Sean sean at semanticbible.com
Sat Mar 11 10:51:14 MST 2006


Thanks, your detailed instructions and example (and a little puzzling 
about how Java works, since i'm not a Java guy) produced some useful 
results, as well as (of course!) a few more questions related to running 
this with Thayer's.

1) there are various complaints: i'm not sure if they're significant
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring 
unexpected entry in orthodoxy of sMinimumVersion
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
entry in orthodoxy: CopyrightHolder=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
entry in orthodoxy: CopyrightDate=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
entry in orthodoxy: DistributionNotes=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
entry in rsv: CopyrightNotes=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
entry in rsv: CopyrightContactEmail=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty 
entry in rsv: DistributionNotes=
org.crosswire.jsword.book.filter.thml.THMLFilter(INFO): Could not fix it 
by cleaning tags: Illegal character or entity reference syntax.

2) the results from Thayer's seem to have lost the Greek characters. 
What's in the .imp file looks like some 8-bit chars
ωφελιμος
which i assume is some kind of representation of the Greek characters 
(haven't quite figured out what: doesn't seem to be UTF-8). But this 
winds up in the output as a string of '?'s.

3) entry 5207 (huios) produces bad XML: looks like a TDNT reference 
attribute in a sync tag doesn't get its terminating quote (after 
"8:400"?) and slash+angle bracket ending the sync are also missing:
AV-son(s) 85, Son of Man +<sync type="Strongs" value="G444" /> 87 (<sync 
type="TDNT" value="8:400, 1210), Son of God
The fault seems to exist in the .imp file as well (which has these 
<sync> tags embedded)

4) there are a number of bare "&" characters in the original which seem 
to get dropped in the output instead of replaced with &amp; (except for 
one in #5207, one might suppose because of the unterminated 
attribute/tag issue)

5) There are some issues with the synonym references around ampersands 
(whether related to #4 i can't tell): the .imp file has
For Synonyms see entry <sync type="Strongs" value="G5811" /> & <sync 
type="Strongs" value="G5889" />
but the OSISified output has
<w lemma='strong:G5811'>
For Synonyms see entry </w><w lemma='strong:G5889'> </w>

Hope this feedback is helpful, and thanks again for the pointers. Unless 
there's a solution to the problem with the Greek characters, i'll have 
to fall back to parsing the .imp file by hand, since getting these out 
is important to me. By the way, what displays in the Sword Project for 
Thayer's lacks accents and breathing marks, though by comparison i see 
them in e-Sword's version: anyone happen to know why?

His,
Sean



More information about the sword-devel mailing list