[sword-devel] Spelling (was Versification/Encoding issues)

Mike Hart just_mike_y at yahoo.com
Thu Jan 8 15:58:44 MST 2009

On issue 4, spelling:

I've taken everyone's advice on spelling to heart, I will try to remain true to the original text copy. 

> As for spelling, and as a fascinating learning experience, pick up your
> printed KJV Bible and examine the spelling of the word "ankle[s]" in Ezekiel
> 47:3 and Acts 3:7.
> Some editions have "ancle", others have
> "ankle".
> Ostensibly both streams are based on the Authorised Version
> of 1769.  So
> Peter's advice is spot on.
> -- David

That's interesting, because ancle is one of the words I corrected in JSFB -- the OCR had ancle, but the PDF itself, my paper KJV copy, and my JPS complete Tanach (individual volumes) had ankle...  I can't say what verse it was, at the time I was hunting for e's that had been OCR'd into c's  (search for 'regular expression' [bcdfghjklmnpqrstvwxy]c[bcdfgjklmnpqrstvwx] in kwrite)

On the subject, but in an opposing view, if you look at the 1611 text of the KJV, you'll note that some ~50% of the words are spelled different from what we call call the "King James Version" today, but it doesn't really seem to matter. Read for example the 23rd psalm, It is still (or originally) the same as what we know and memorize in Sunday school at age 9, regardless of the spelling. I don't remember the spelling when I recite. 

KJV1611 23rd pfalme

(there's a zoom button in the upper left margin, it is readable at 50% ) (**)-see further note below.

Since the 1769 version is still called the "King James" and they both read largely the same, I'd say the spelling is not as important as the word (as pronounced). And even then, a good number of words have been 'updated' from the 1611 copy in the 1769 'true' KJV. 

I've taken everyone's advice on spelling to heart, I will try to remain true to the original text copy. 

That said, If you look at the quality of the Jewish School and Family Bible scans, you will see that I'm up against a mammoth task just getting a readable text, much less one that is letter-exact. About 10%-20% of the words were mis-interpreted by the OCR. I've managed to reverse engineer the OCR process and repair the meaning of most words. That is, an OCR interprets the same font the same way most of the time, so what may appear to be gibberish in the OCR output can be repaired by careful examination of the OCR errors. For example, in JSFB, the italicized words are generally simple short modifier words: the, of , to, etc.  The OCR did poorly at interpreting these words, but it did do a fair job of being repeatable in how it interpreted them ("of" turned into o/* or o/' or o/".)  I've done countless search and replace for things like V/ -> W, etc to restore the characters to readable text. What I've got now matches the PDF for 95+% of my random checks,
 with mostly missing letters and punctuation for most mismatches now. (and no I'm not trying to keep italicized words.. plain text only. )

Additionally, In the JSFB, verses are marked in the margins only. I am restoring the verse indicators to the verse divisions. In volume 1 this is easy, because the verse divisions appear as asterisks. (Don't ask me why, I don't see any divisions in the PDF, but they are there in the 2nd copy of volume 1 on the archive ( http://www.archive.org/details/schoolfamilybibl01beni ) In the other volumes, the verse division is generally the nearest punctuation mark, but not always. The "not always" part gets tricky. I'm referring to the JSFB PDF, A hardcopy KJV, and a JPS new Tanach to see. 

Additionally, the JSFB has copious foot notes on each page (average 10 notes a page). I'm unable to devise a capture technique for the notes on this revision, so these are being tossed. The footnote markers are presenting another level of special problem, in that they mess with the word they're attached to. 

After all these issues, I by myself, will never be able to certify the correct spelling of each word from this witness, and that isn't my intention, because there is so much more to do. I'm semi-dyslexic anyway, so editing would never be my strong point. This work has a different (unique to me anyway) approach to translation, (uses "The Eternal" For the tetragrammation, for example) that seems to be interesting enough to study, and I study in bibletime or bible desktop, so I want it there. 

The years 2002-2008 were explosive for online texts. Over 1 million books now reside at the Internet Archive alone, and Google was a bigger (but more recent) operation. However, The bubble is over. The rate of books going online will drop significantly due to Microsoft dropping its program, and Google settling the lawsuits against it by the publishing industry. 

It is my belief that these texts (especially Judaeo-Christian texts) may not always be readily available online, so there is a limited window while they are being offered for free download to snag what you can. Also, there are many areas of the world where 'bibles' cannot be accessed online.  

When I first started looking for downloadable bibles in 2001, the "universal library" ( http://www.ulib.org -- Carnegie Mellon University ) had some bibles on it (way more than I could fit on my huge 2G hard drive at the time.)  If you go search there now, they are largely missing. For the search "BIBLE" Some 500 listings come up, but try to actually view one. I don't see this as omission, but censorship.  Do the same search on Google (130,000 full view books) and the Internet Archive (11,000 texts), and note the difference in quantity of search results and availability of the texts. 

In years to come, with more people involved, adding footnotes, italics and certifying the spelling may be warranted. For now, making the words of the text itself available for study is my intention.  I have a very long list of texts to work on, so I won't be 'perfecting' any, but 'improving' many. 

(**) KJV1611 http://www.archive.org/details/holybiblefacsimi00polluoft
(Completely offtopic: the notes in this witness are OCR'd as separate collumns.. meaning the text files from this work may be a good candidate for a module. I think I put this on the module request list, but it was removed... a 190 page preface does tend to obscure the fascimile behind it. The old gothic characters make for a 95% OCR error rate, but there is a good chance that the OCR can be corrected through error analysis.)


More information about the sword-devel mailing list