[sword-devel] StripText() result not converted to UTF-8?

Troy A. Griffitts scribe at crosswire.org
Sun Feb 18 12:48:22 MST 2007


> I guess cp1252 should remain to be the default output of the thml2plain 
> filter, shouldn't it?

At first thought, I don't think so.  Consider a Chinese ThML module.  If 
the encoding was BIG5 or something else, they could set the 
Encoding=BIG5 in the .conf file, which SHOULD get encoded to UTF-8 
internally, and then to the output encoding chosen by the user 
(programmer).  I realize this isn't how things would work right now, but 
it seems like the most logical flow of events, from an initial 
consideration.  The markup filters cannot be concerned with encoding, 
and if we have to pick one, it seems it should be our chosen internal 
encoding preference of UTF-8.  What do you think?


> Joachim
>> Joachim,
>> 	I believe the filter is wrong.  It should return the UTF-8 value.  This
>> is a bug.  Anyone want to look through the unicode code chart and recode
>> all these values?
>> http://www.unicode.org/charts/PDF/U0080.pdf
>> 	Sorry for the bug Joachim.
>> 		-Troy.
>> Joachim Ansorg wrote:
>>> Hi,
>>> replying to myself.
>>> I've been wrong in some of my assumptions.
>>> JFB is ThML. It contains the entity Æ
>>> StripText() calls the filter ThMLPlain which converts the Æ into
>>> 0xC9, which is the corresponding cp1252 character code.
>>> I thought that StripText() would remove all markup and return text in the
>>> encoding given to EncodingFilterMgr.
>>> My question:
>>> Is that right or wrong?
>>> Some help would be wonderful,
>>> Joachim
>>>> Hi,
>>>> I'm just debugging a bug in BibleTime.
>>>> Our SWMgr is created to output utf8.
>>>> The module JFB contains the entitiy Æ .
>>>> When I call StripText() the entitity is converted to the corresponding
>>>> character in the cp1252 charset, i.e. char with the value 0xC9.
>>>> I thought that the latin2utf8 filter would convert this plain text to
>>>> utf8 because I told SWMgr to do this for me.
>>>> Is there a way to set the output encoding for StripText() to be
>>>> different than the module's encoding?
>>>> Thanks a lot,
>>>> Joachim
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page

More information about the sword-devel mailing list