[sword-devel] Re: Westcott-Hort

Sun, 04 Apr 2004 11:38:40 -0700

Costas,
	A few comments...

Costas Stergiou wrote:
> Hi David/Troy,
> looking at the texts, I think there is some work to be done:
> - remove any combining diacriticals & process everything as precomposed.

I think this is backwards.  From my limited understanding and from 
reading recent posts on sword-devel from people with much more knowledge 
than me, I think the text should be stored with no precomposed 
characters.  If the renderer needs to send precomposed characters to the 
display control, then it (sword can do this with an ICU filter, I think) 
can precompose them.

> - remove common mistakes found by text processing (e.g. wrong letters)
> - fix missing spaces between some words
> - compare words with other accented texts to find other errors, etc.
> 
> Right now I am working on an accented greek text (a byzantine one) which
> looks very very good. It is supposed to be the official greek text used by
> the eastern orthodox church. Actually, it is very close to the byzantine (at
> some times with the TR). I also have a printed version of it, and it does
> seem very good. I got it from http://kainh.homestead.com.
> 
> At the same time, I have been working on some other accented greek texts
> also.
> What I think is that by having all those accented texts, maybe I could right
> a util that takes almost any unaccented greek text, looking at it verse by
> verse and adding diacriticals by using the various accented versions I have.
> I am not sure that this is feasible but when I look at the differences
> between the texts, i realize they are very small and most (if not all of
> them) can be found programmaticaly.
> 
> About the WH you send me: i would like to test it through all the various
> scripts i have and make any corrections taking my time.
> 
> One think is important here:
> All the above can only happen on the texts WITHOUT the strong & morph tags.
> So, I suppose, we need to find a generic way of adding these later. I think
> it is not difficult, but since I don't know the specifics I cannot tell.

You can iterate thru the text without Strong's with a very simple routine:

SWMgr swordLibrary;
swordLibrary.setGlobalOption("Strong's Numbers", "Off");
SWModule *whac = swordLibrary.Modules["WHAC"];
for ((*whac) = TOP; !whac->Error(); (*whac)++) {
	cout << whac->RenderText();
}

I would suggest using your scripts to find errors with the above code, 
then correcting the error in the module with strongs/morph.  You can 
export the module with mod2osis or mod2imp-- whichever is easier for you 
to work with.  Then you can import it back with osis2mod or imp2vs.

	Thanks for all the work you guys are thinking about and doing!  I'm 
excited to see these resources excel!

	-Troy.

> What I can do, is the processing of the Greek texts (which is natural to
> me). I will be happy to collaborate on this with anyone else interested.
> 
> David: for now, I think there is nothing I would need from you, I still have
> to progress myself.  You also mentioned polycarp66. Who is he? Maybe he
> could help out also?
> 
> It would be good to post this to the sword list in case others are working
> on similar issue: Troy, you can do this if you think it would be beneficial.
> 
> With love in our Lord Jesus Christ,
> Costas
> 
> P.S. (maybe Chris should be reading this also since I think he is the module
> expert? not sure...)
> 
> 
> 
> 
> ----- Original Message -----
> From: <RDN12345@aol.com>
> To: <csterg@ece.ntua.gr>
> Sent: Saturday, April 03, 2004 8:08 PM
> Subject: Re: Westcott-Hort
> 
> 
> 
>>Costas,
>>
>>I wrote software that extracted the strong's numbers from the unaccented
> 
> W.H.
> 
>>byztxt.com and inserted them in the accented W.H. from CCEL.
>>The software examines both texts 1 verse at a time, creates a word list
> 
> for
> 
>>each, attempts to compare the words ignoring case, and accents, and paying
>>attention to the order of the words while attempting to find a match for
> 
> where a
> 
>>strong's number should be placed.
>>Reinserting the strongs numbers would probably be possible, but require
> 
> some
> 
>>modification of the software since it looks at the html encoded unicode
>>(&#XXX) in the pages from CCEL. The html files that I have, may not be the
> 
> same as
> 
>>what is on CCEL, polycarp66 fixed some errors, missing text etc. He sent
> 
> all of
> 
>>his corrections to CCEL, but I do not if they have replaced their files
> 
> with
> 
>>his corrected ones. It has been a while since I worked on the W.H. I will
> 
> try
> 
>>to locate all of the current files. I do not know greek, so all I can do
> 
> is
> 
>>fix character encodings, remove the extra spaces etc.
>>It may be best to correct the html files (for any corrections that must be
>>done by hand), and then reprocess everything.
>>If needed the html files could simply be converted to utf8, and the
> 
> strong's
> 
>>numbers left out. (If you do not need the strong's numbers.)
>>The text could be reprocessed with strong's numbers inserted for Troy.
>>
>>There were also some differences in versification between the texts.
>>
>>I have attached zip file containg :
>>wc_a_.txt
>>The processed W.H. from byztxt.com (verse per line) it is encoded for the
>>OLBGreek font.
>>
>>This verse was removed, because it appears to be completely enclosed in
>>variant markers, does not exist in the CCEL W.H., and I did not know what
> 
> to do
> 
>>with it.
>>
>>12:47 | | [eipen 2036 5627 {V-2AAI-3S} de 1161 {CONJ} tiv 5100 {X-NSM}
> 
> autw
> 
>>846 {P-DSM} idou 2400 5628 {V-2AAM-2S} h 3588 {T-NSF} mhthr 3384 {N-NSF}
> 
> sou
> 
>>4675 {P-2GS} kai 2532 {CONJ} oi 3588 {T-NPM} adelfoi 80 {N-NPM} sou 4675
> 
> {P-2GS}
> 
>>exw 1854 {ADV} esthkasin 2476 5758 {V-RAI-3P} zhtountev 2212 5723
> 
> {V-PAP-NPM}
> 
>>soi 4671 {P-2DS} lalhsai] 2980 5658 {V-AAN} |
>>
>>wh_b.txt
>>The text that is actually stripped from the CCEL html files with
>>versification modified to match the versification of the text from
> 
> byztxt.com.
> 
>>I am not sure what needs to be done, so you will have to tell me what you
>>need me to do.
>>
>>David
>>
>>
>>
>>