[sword-devel] Hyphens in book names
dmsmith at crosswire.org
Wed Sep 29 19:13:08 MST 2010
I've read the thread and I'd like to add my thoughts:
I don't think the discussion regarding whether - is a letter is constructive. We have a problem to solve. Right now - is a meta-character indicating a range.
I think we should extend the book name parser to work with Bible book names as they occur in other languages and may be input into our front-ends. This includes -, non-arabic digits (I think thats what 0-9 are called?) and things like ' that if I understand it are representative for things like clicks, whistles or glottal stops.
In the case of JSword, it is a tough problem. We split the input into a token stream. The splitting is relatively naive and does split on -.
I've thought about how I'd fix it and I have not found a good solution. One edge case that is allowed is Gen-Exo. Which is everything from the beginning of Genesis to the end of Exodus.
My thought is to take the book names (also abbreviations and alternates) for English, the user's locale and the language of the module and build a trie. Then a given input is analyzed against the trie for the longest matching prefix. As long as the next char is found in the trie we keep going. If the next "char" is not in the trie and is a letter then we have an error. If it is not a letter we take the match and using the trie find all the matches with that prefix. Disambiguation is handled in the usual way.
In SWORD, it'd be easy to knit this kind of recognizer into the parser.
As to numbers, I'd suggest using an ICU number shaper to map all numeric values in an input into 0-9. We do this in JSword for Arabic and Farsi and it works quite well.
Regarding OSIS, it is a fixed dictionary of internal names for all books. They are not meant to be shown to users, even though many would have no problem understanding them.
And regarding OSIS, we subject osisRefs and osisIDs to the same parser. I think there should be a separate parser, which would be very simple, that would parse it into our internal form. If the reference comes out of an OSIS encoded Bible, then we could have a great gain.
Here's the rub, someone has to step up and tackle it. The code for SWORD is all tucked into a single method. In JSword, it is spread out into a finite state automata that is hard to change. It will just have to be replaced.
On Sep 29, 2010, at 4:55 PM, Robert Hunt wrote:
> New Zealand.
> Hello all,
> I am spending today studying the documentation on the Crosswire Sword wiki so I'm likely to have a few questions. Please let me know if this is not the right forum to ask questions.
> I see in http://www.crosswire.org/wiki/DevTools:SWORD that localised book names are not allowed hyphens in them (because the hyphen is used for verse ranges). In the Philippine language that we worked with as Bible translators, the hyphen is a letter in the alphabet and appears in several book names!
> Is this still a current limitation? If so, what is the suggested work-around.
> sword-devel mailing list: sword-devel at crosswire.org
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel