[sword-devel] Likely missin Arabic vowel point filter.
chrislit at crosswire.org
Fri Dec 16 01:28:51 MST 2011
On 12/15/2011 10:12 PM, Peter von Kaehne wrote:
> On Fri, 2011-12-16 at 00:50 -0500, Paul A. Martel wrote:
>> Unless I'm reading this wrong, this clause is impossible to satisfy.
>> So it effectively _disables_ the intended filtering of the 0xFC code
>> page characters.
> Thanks Paul. I had been staring at this piece of code for a long time
> and not figure why it does not work. It was me who wrote it, copying
> another filter.
I took a look at this code (and the Hebrew correlate to make sure it
didn't have the same problem). The UTF8ArabicPoints filter isn't doing
what is intended, even if the issue Paul points out is corrected.
It's important to note that these are UTF-8 filters only. So they only
work with UTF-8 encoded text, and they operate directly on the UTF-8
bytestream, not Unicode codepoints. The existing UTF8ArabicPoints filter
assumes it is operating on codepoint values, which would work for UTF-16
text, but not UTF-8. The byte sequences i looks for aren't even legal as
UTF-8 characters, so the best this could achieve is to output illegal UTF-8.
As an example, the filter is supposed to remove codepoints
0x064B-0x0655, which would be UTF-8 0xD9 0x8B through 0xD9 0x95.
I don't actually know the proper way to encode Arabic, from the
perspective of composition. Even the basic Arabic block has quite a few
forms that look like they are composed with vowels. Apparently I should
do some homework on this subject, but if there are consonant+vowel
composed characters that would be used for Arabic, we should address
those in this filter also, by transforming them to their consonant bases.
More information about the sword-devel