[sword-devel] Likely missin Arabic vowel point filter.

Fri Dec 16 01:28:51 MST 2011

On 12/15/2011 10:12 PM, Peter von Kaehne wrote:
> On Fri, 2011-12-16 at 00:50 -0500, Paul A. Martel wrote:
>> Unless I'm reading this wrong, this clause is impossible to satisfy.
>> So it effectively _disables_ the intended filtering of the 0xFC code
>> page characters.
>
> Thanks Paul. I had been staring at this piece of code for a long time
> and not figure why it does not work. It was me who wrote it, copying
> another filter.

I took a look at this code (and the Hebrew correlate to make sure it 
didn't have the same problem). The UTF8ArabicPoints filter isn't doing 
what is intended, even if the issue Paul points out is corrected.

It's important to note that these are UTF-8 filters only. So they only 
work with UTF-8 encoded text, and they operate directly on the UTF-8 
bytestream, not Unicode codepoints. The existing UTF8ArabicPoints filter 
assumes it is operating on codepoint values, which would work for UTF-16 
text, but not UTF-8. The byte sequences i looks for aren't even legal as 
UTF-8 characters, so the best this could achieve is to output illegal UTF-8.

As an example, the filter is supposed to remove codepoints 
0x064B-0x0655, which would be UTF-8 0xD9 0x8B through 0xD9 0x95.

I don't actually know the proper way to encode Arabic, from the 
perspective of composition. Even the basic Arabic block has quite a few 
forms that look like they are composed with vowels. Apparently I should 
do some homework on this subject, but if there are consonant+vowel 
composed characters that would be used for Arabic, we should address 
those in this filter also, by transforming them to their consonant bases.

--Chris