[sword-devel] API additions

Thu, 8 Nov 2001 06:10:47 -0800

I've added a large number of new classes over the last few days and made
a few other minor adjustments.

ThML option filters were added (ThMLFootnotes, ThMLStrongs,
ThMLHeadings, ThMLMorph, ThMLLemma, ThMLScripref).  They act just like
the GBF counterparts but work with ThML.

SWModule and all of its descendants now take a language value passed to
their contructors.  You can call the Lang() method to retrieve the
value.  We needed this for BibleCS because WinNT & Win9x handle right to
left texts differently depending on the language, but there are other
good uses for language information like sorting/filtering by language.
The SWModule contructor is getting quite large and I'm ready to suggest
we start passing a module info struct instead of separate arguments so
that new information can be added to the module less painfully.  (But we
should retain the current contructor for backwards compatability.)

The other four classes I added all require ICU and are all SWFilter
descendants:

UTF8NFC is normalizes according to Normalization Form C (NFC) which
should turn text into it's most composed form.  In other words combining
accents will compose with the letters they follow such as an "a"
followed by an "umlaut" will turn into an "a-umlaut" character.  Since
this is how all our texts should be distributed anyway, it may not be
that useful to anyone.

UTF8NFKD normalizes according to Normalization Form KD (NKFD) which is
compatability decomposition.  That means an "a-umlaut" turns into an "a"
followed by a combining "umlaut".  This filter should be used as a strip
filter when performing searches because searches are best performed on
strings in NFD or NFKD.

UTF8BiDiReordering will reorder text according to visual order.  So
passing it Hebrew, Arabic, or Syriac should return a reversed string.
Passing it English should return the same string.  (And it should be
able to handle any reasonable mix of scripts/directionalities.)

UTF8arShaping will perform Arabic shaping on a string.  Arabic text is
encoded with an abstract character, which is usually represented by the
isolated form glyph in fonts.  This class will convert the abstract
character codepoint to a codepoint in the Arabic presentation forms area
corresponding to the initial, medial, final, or isolated form of the
glyph, depending on its position in the word.

--Chris