[sword-devel] usfm2osis.py

Greg Hellings greg.hellings at gmail.com
Sun Aug 5 17:28:13 MST 2012


On Sun, Aug 5, 2012 at 7:19 PM, Chris Little <chrislit at crosswire.org> wrote:
>
>
> On Aug 5, 2012, at 11:37 AM, David Haslam <dfhmch at googlemail.com> wrote:
>
>> FWIW, I just came across this  http://www.pythonregex.com/ Python Regular
>> Expression Testing Tool
>>
>> Does Python support the full 21-bit Unicode range?
>>
>> cf. Many other regular expression engines only support the Basic
>> Multilingual Plane.
>>
>
> Yes, Python regex supports non-BMP characters. The language tags are Plane 14, I believe. An engine that supports only the BMP can't be said to support Unicode and is probably just processing bytes.
>

As further explanation, Python differentiates between the "string"
object, which is 8-bit encoding representation of objects in any
selected encoding and "unicode" objects which are strings of Unicode
characters. The exact internal representation probably differs between
CPython and Jython. CPython used to use UCS-2 but now can use either
UCS-2 or UCS-4 since the extension of the BMP.

To read more details see
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
under the heading "Internal Representation".

--Greg

> --Chris
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page



More information about the sword-devel mailing list