Details
-
Type:
Bug
-
Status:
Open
(View Workflow)
-
Priority:
Major
-
Resolution: Unresolved
-
Component/s: Encoding problem
-
Labels:None
Description
Example: (diatheke output)
Joshua 5:6: Διοτι τεσσαρακοντα ετη περιηρχοντο οι υιοι Ισραηλ εν τη ερημω, εωσου ετελευτησαν πας ο λαος, οι ανδρες του πολεμου, οι εξελθοντες εξ Αιγυπτου, επειδη δεν υπηκουσαν εις την φωνην του Κυριου προς τους οποιους ο Κυριος ωμοσεν, οτι δεν θελει αφησει αυτους να ιδωσι την γην, την οποιαν ωμοσεν ο Κυριος προς τους πατερας αυτων οτι θελει δωσει εις ημας, γην ퟻ�ࠏƏĎҎݠγαλα και μελι.
These encoding errors may have occurred during one of these changes:
History_1.6=converted to UTF-8 and compressed
History_1.5=Changed to Symbol font encoding
-
- gkmod.raw.imp.extra.from.phlp.txt
- 09/Feb/13 8:27 AM
- 4 kB
- David Haslam
-
- gkmod.raw.imp.extra.google.translate.txt
- 09/Feb/13 8:15 AM
- 2 kB
- David Haslam
-
- gkmod.raw.imp.extra.txt
- 09/Feb/13 8:06 AM
- 4 kB
- David Haslam
-
- UMGreek.diatheke.character.frequency.txt
- 08/Feb/13 3:06 AM
- 5 kB
- David Haslam
Activity
The diatheke observations are confirmed by examining the affected locations in Xiphos.
FIO. I still have a copy of the ThML file from CCEL dated 2002-12-31.
NB. They obtained the text from the Unbound Bible Project.
This text does not have these encoding problems.
Upon request, I can readily send a copy to our chief module maker.
umgreek.conf includes:
SwordVersionDate=2002-01-01
Looking at the 2002 dates, it's quite conceivable that Unbound Bible scraped our own earliest text to make theirs.
If we've not retained an archive version of our 2002 text from before we managed to introduce these encoding errors, that would imply that it may be legitimate (under our text sourcing policy) to recover the correct text from that ThML file or even directly from the Unbound Bible project.
btw. The aforementioned ThML file has 31118 verses, which indicates that it is Alternate Versification.
cf. KJV has 31102 verses. Difference = 16. Need to determine where the extra verses are.
cf. The UMGreek module is default v11n.
UMGreek also has some misplaced verse text.
e.g. Just above John 1:1 it displays this (in Xiphos):
Και αφου υπεστρεψαν ητοιμασαν αρωματα και μυρα. Και το μεν σαββατον ησυχασαν κατα την εντολην.
Google translates:
And when they returned had prepared spices and ointment. And while I calmed down during the Sabbath commandment.
Which corresponds loosely to the last verse in Luke 23.
And they returned, and prepared spices and ointments; and rested the sabbath day according to the commandment.
(Luke 23:56 [KJV])
cf.
Και αφου υπεστρεψαν ητοιμασαν αρωματα και μυρα. Και το μεν σαββατον ησυχασαν κατα την εντολην.
(Luke 23:56 [UMGreek])
This is sample evidence that the original build had some serious structural problems!
As part of my work towards solving this issue, I've converted the ThML file to IMP format (seeing as ThML is deprecated by CrossWire for new module builds), and I just made a new module, temporarily named GkMod.
This suffers none of the issues noted above for the UKGreek module.
I'll just build from Unbound Bible's files. We have a well-tested converter from their format, and contrary to their statements, their files are not derived from ours. (I've got files with a timestamp from 2000 that demonstrate their UTF-8 files predate ours. CCEL's files are also fairly clearly derived from Unbound Bible's (or an intermediary source)--plus omissions and "copyright".)
Excellent!
It was entirely providential that I happened to come across these issues this morning.
I'd just selected UMGreek to test a Windows CMD script I'd written to save me a little time in future.
It wasn't something I'd set out to look for when I got up this morning.
Yet all in all, a satisfying outcome, with a solution to look forward to presently.
It's probably time to rename this module to something more in line with our other modules, i.e. Gre{Translator}. I haven't been able to figure out the translator, however. Wikipedia identifies 3 Modern Greek Bible translations. Two of those appear to be NT-only. The third is Vamvas' translation, which we have elsewhere. Biblos.com claims this translation is Vamvas', but it is clearly not. (Original Vamvas text: https://babel.hathitrust.org/shcgi/pt?id=nnc1.0046038183;seq=7;view=1up )
The difference of 16 in the verse count in file gkmod.thml is more likely to be due to a duplication of a 16 verse chapter, rather than a symptom of av11n. There were no appending warnings logged when I made a module.
This observation is recorded just in case the same problem may also be present in the Unbound Bible project source text.
Yet to home in on the location where these 16 extra verses are.
osis2mod does not keep track of the verses it has seen. When it sees a verse (osisID), it checks to see if it is in the versification. If not, it'll get the best prior verse to append it to. But if it is in the versification, it merely appends the content to the "dat" file and notes its location in the index. So if the index already had a value for it, the verse is in the dat file twice, but the second write slams the first.
One way to find this is to build a raw text module (no compression flag) and turn on the debug feature with -d 2. This will put in milestone markers for each verse start and end of the form:
<milestone resp="v" .../> where ... are all the attributes copied in from the milestoned version of the verse tag.
You can then look for duplicates of these milestones.
I added this feature so that I can quickly find problems. No need for mod2imp. The dat file has everything needed. And the SWORD/JSword programs will ignore the milestone. So you can look at the results in a front-end.
Not what I expected earlier!
The 16 extra verses in gkmod appear at the end of the first 3 chapters of Matthew.
The three extra passages are: Matt 1:26-30, 2:24-30, 3:18-21 and these are outside the KJV v11n.
The three extra passages in gkmod are in the attached text file.
If you paste each passage into Google translate, it becomes clear that they are from elsewhere in the NT.
Looks like they might be undetected copy & paste errors!
Attached file = Google translate for the extra passages in gkmod.
Despite the very imperfect translation, where they come from in the Epistles is recognizable.
They are copies of the passages in Philippians 1,2,3 with the same verse references,
albeit with some trivial instances of multiple spaces between words.
Thanks DM - I hadn't spotted your comment prior to investigating where the 16 verses were found.
"No need for mod2imp." was not really the case, as I'd made my provisional comparison module using imp2vs.
And for that matter, I'd never converted gkmod.thml from ThML to OSIS.
Attached text file is where the extra passages must have been copied from in Philippians. Mystery solved!
As it happens, the Go Bible edition I'd made was from gkmod.thml (obtained before I joined CrossWire).
I can therefore edit that file to remove the spurious 16 verses and then rebuild that particular Go Bible.
cf. Go Bible Creator doesn't care about v11n.
The Modern Greek Go Bible edition has been rebuilt with the required corrections.
The Unbound Bible edition of the Modern Greek translation did NOT suffer the extra 16 verses error.
It has two particular oddities though.
(a) The use of [] to tag the start of new paragraphs
(b) The use of << ... >> to wrap the canonical Psalm titles and the 22 stanza titles in Ps.119.
The latter conflicts with XML markup, so the "well-established" tool needs to be capable to deal with these.
Are you sure about [] being used to identify paragraphs? I had guessed that was their meaning myself, but I wasn't confident.
The total number of [] is 4258.
Without access to a printed edition or facsimile one cannot be 100% certain.
My next door neighbour is Greek. I could ask to see it if he owns a copy.
Even so, if we make the assumption that these are paragraph marks (or even a substitute for pilcrows),
should it turn out otherwise, little damage would have been done, and it would be a simple task to make
an updated module later.
When the diatheke output is opened with BabelPad, encoding errors are reported.
The 326 characters that show up as Unicode replacements U+FFFD are not the only encoding problems.
Character frequency tool finds more unexpected characters, including some private use code-points!