[sword-devel] NFC and osis2mod

Chris Little chrislit at crosswire.org
Mon Feb 4 09:16:09 MST 2008


I've been meaning to work on this, but I thought I'd try to point you  
in the right direction since I'm pretty sure I won't have time in the  
next couple days. (I'll be at the UTC meeting at Apple--thinking about  
all things Unicode, but not working on them.) The bugs are almost  
definitely within utf8nfc.cpp. It's never been employed, to my  
knowledge, so it's never been debugged. It's also fairly old and  
should probably be updated to a newer version of the ICU API. (The  
could code work with the existing functions, but it might be best to  
update using some of the copious examples at ICU.)

Your code looks fine to me. My old utf8nfc.cpp code looks a mess.

--Chris


On Jan 31, 2008, at 5:19 PM, DM Smith wrote:

> Can someone offer some pointers as to what I am doing wrong?
>
> I am trying to add the ability to osis2mod to optionally ensure that a
> UTF-8 document is normalized to NFC.
>
> I added -n as a flag to indicate that normalization should occur and
> set a global boolean variable "normalize" to true iff the flag is
> present.
>
> Rather than reinventing the wheel, I figured Sword's UTF8NFC filter
> would be the ticket.
>
> First I added the header with:
>
> #ifdef _ICU_
> #include <utf8nfc.h>
> #endif
>
> And I created a global variable:
>
> #ifdef _ICU_
> UTF8NFC normalizer;
> #endif
>
>
> Then right before adding the entry I ran it through the filter:
>
> #ifdef _ICU_
> 			if (normalize) {
> 				normalizer.processText(activeVerseText, (SWKey *)2);  // note the
> hack of 2 to mimic a real key. TODO: remove all hacks
> 			}
> #endif
>
> Now I ran the KJV.xml at www.crosswire.org/~dmsmith/kjv2006 through
> osis2mod.
>
> Since I thought I had already normalized the text, I expected a diff
> to show nothing.
>
> However I found corruption in Matthew 3:17 at the end of the raw text
> in the module. (and many places later.)
>
> The corruption is always at the end of the line. Here is the raw text
> for that verse:
> <w lemma="strong:G3588" morph="robinson:T-NSM" src="13"></w><w
> lemma="strong:G2532" morph="robinson:CONJ" src="1">And</w> <w
> lemma="strong:G2400" morph="robinson:V-2AAM-2S" src="2">lo</w> <w
> lemma="strong:G5456" morph="robinson:N-NSF" src="3">a voice</w> <w
> lemma="strong:G1537" morph="robinson:PREP" src="4">from</w> <w
> lemma="strong:G3588 strong:G3772" morph="robinson:T-GPM robinson:N-
> GPM" src="5 6">heaven</w>, <w lemma="strong:G3004" morph="robinson:V-
> PAP-NSF" src="7">saying</w>, <w lemma="strong:G3778"  
> morph="robinson:D-
> NSM" src="8">This</w> <w lemma="strong:G2076" morph="robinson:V-
> PXI-3S" src="9">is</w> <w lemma="strong:G3450" morph="robinson:P-1GS"
> src="12">my</w> <w lemma="strong:G27" morph="robinson:A-NSM"
> src="14">beloved</w> <w lemma="strong:G3588 strong:G5207"
> morph="robinson:T-NSM robinson:N-NSM" src="10 11">Son</w>, <w
> lemma="strong:G1722" morph="robinson:PREP" src="15">in</w> <w
> lemma="strong:G3739" morph="robinson:R-DSM" src="16">whom</w> <w
> lemma="strong:G2106" morph="robinson:V-AAI-1S" src="17">I am well
> pleased</w>.<milestone resp="pdy 2003-12-14-08:48" type="x-
> strongsMarkup"/>="22"꧁
>
>
> Any help would be appreciated.
>
> Thanks!
>
> Working together,
> 	DM Smith
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page




More information about the sword-devel mailing list