[sword-devel] Improvements to osis2mod to handle XML comments and <header> correctly

Tue Apr 6 11:29:29 MST 2010

Hi DM:

This is a lot lower than what an xml tokenizer needs. This would be a
> tokenizer for the text between tags. Having a single tokenizer that does
> both would be more efficient when both are wanted and slower when only xml
> tokens are needed. ¶ I think a model could be constructed that could do both
> and allow one to ask for the depth of tokenization that is needed.

Good idea. So while a tokenizer should support atomic granularity, for many
purposes this is overkill. What would the levels of specificity be?
Verse-level, word-level, and then atomic-level? Atoms would include
whitespace and punctuation marks that aren't currently marked up with <w>
elements.

There is a big complication with the parsing of text: it is language
> dependent. For example, Thai has words but not word breaks. Basically, the
> task will require a Unicode and somewhat language aware word-break
> algorithm. The best I've seen is in ICU.
>

Yes, the text tokenizers would need to be language dependent, but the parser
would not be, correct? While osis2mod does try to make it so that there is
only one tokenizer & parser needed for any text, it currently doesn't
support all of OSIS as has been discussed; and the amount of logic needed in
that single osis2mod package is apparently getting overwhelming. And it also
requires that authors convert into OSIS if they haven't already. Instead of
requiring authors to convert their raw data formats into OSIS which then get
converted into a SWORD Module, what about if authors could ‘just’ write a
script that parses the raw data for tokens and then streams these directly
to a text-independent parser which then can generate the SWORD Module, etc?
This standard common parser could be available as a web service or
downloaded as a local library. This would would eliminate the need for
osis2mod to account for every possible permutation of an OSIS document,
because the author's tokenizer would normalize the input into a consistent
stream of tokens, e.g. start_verse, start_paragraph, word, punctuation,
space, line_start, end_quote, etc. And there would be a separation of
concerns to make the import process more modular.

So to summarize, the idea is to break the text import process into two
steps: tokenizing and parsing. As much import logic as possible would be
moved to the common standard parser, and the tokenizers would only have to
deal with the unique aspects of the text; there could be a standard library
of tokenizer helpers too. Furthermore, there could be standard ready-made
OSIS tokenizers made available which could handle the various permutations
of OSIS (e.g. BSP, BCV), or they could be customized if the OSIS data isn't
normalized enough. The interface between the tokenizer and the parser would
be the token stream that the tokenizer would feed the parser. Breaking down
the text import process into smaller special-purpose scripts which respect
the separation of concerns should make the import task more manageable and
would reduce the need for a single monolithic importer.

Thoughts?

Weston

On Mon, Apr 5, 2010 at 11:18 AM, DM Smith <dmsmith at crosswire.org> wrote:

>  On 04/05/2010 01:44 PM, Weston Ruter wrote:
>
> DM:
>
> But what we really need is not a parser but a tokenizer. I'm thinking about
>> writing one (my degree work was in compiler writing). Basically, we repeat
>> the same tokenization code in several places. It should be trivial to write
>> a complete, accurate one.
>>
>
> I've also been wanting to work on a tokenizer. At Open Scriptures, the text
> of a work is currently represented by two models<http://github.com/openscriptures/api/blob/master/models.py>(database tables):
> Token <http://github.com/openscriptures/api/blob/master/models.py#L242>and
> Structure<http://github.com/openscriptures/api/blob/master/models.py#L315>.
> Tokens are the smallest divisible units of text, such as words, punctuation,
> and whitespace; and structures are the spans of tokens that form logical
> units, such as verses, paragraphs, quotes, etc. The structures are
> standoff-markup for the tokens. With the underlying data stored in this way,
> it can then be serialized in whichever hierarchy desired
> (book-section-paragraph, book-chapter-verse, all-milestoned, etc) or
> whichever data format is needed (OSIS, SWORD Module, XHTML, etc.)
>
>
> This is a lot lower than what an xml tokenizer needs. This would be a
> tokenizer for the text between tags. Having a single tokenizer that does
> both would be more efficient when both are wanted and slower when only xml
> tokens are needed.
>
> I think a model could be constructed that could do both and allow one to
> ask for the depth of tokenization that is needed.
>
> There is a big complication with the parsing of text: it is language
> dependent. For example, Thai has words but not word breaks. Basically, the
> task will require a Unicode and somewhat language aware word-break
> algorithm. The best I've seen is in ICU.
>
> Lucene has a wonderful example in their Jira issues database of how to do
> tokenization. (1488, if I remember.)
>
>
>
>
> So what I'm currently rumenating on is the process of importing the raw
> data into the Token and Structure models. I wrote an importer<http://github.com/openscriptures/api/blob/master/importers/Tischendorf-2.5.py>for the Tischendorf GNT data which does everything both tokenizing and
> parsing, but obviously there is going to be a lot of code in common with
> other importers that are written. So I too am thinking about how these
> importers can be reduced to the bare minimum to handle the unique aspects of
> the raw data (i.e. normalize it), and then stream the tokens back to a
> central importer that parses the input and stores it into the Token and
> Structure models. This central importer facility could be a web service.
>
> I've love to collaborate with you on this. We could come up with a common
> tokenizer that can be used by both SWORD and Open Scriptures. The importer
> web service could take tokens as input and as output generate a SWORD module
> and also populate the Open Scriptures models at the same time.
>
> Thoughts?
>
>
> Sounds good to me, too.
>
> In Him,
>     DM
>
>
>
> Weston
>
>
>
> On Mon, Apr 5, 2010 at 10:24 AM, Daniel Owens <dhowens at pmbx.net> wrote:
>
>> Yes, I agree, and if there were a feedback mechanism for the module
>> creator to let them know how to start fixing an OSIS file or conf file, it
>> would save Chris (or whoever else approves modules) time on the basic stuff.
>>
>> Daniel
>>
>>
>> On 4/5/2010 11:09 AM, DM Smith wrote:
>>
>>> This is a great idea. Rather than emailing source to modules at crosswire
>>> dot org, one could upload it via a web service. We could have stages of
>>> validation (xmllint) and construction (osis2mod). Such a service could
>>> evaluate the quality of the submission.
>>>
>>> In Him,
>>>    DM
>>>
>>> On 04/05/2010 12:01 PM, Weston Ruter wrote:
>>>
>>>> Why not turn osis2mod into a web service? Then it wouldn't matter how it
>>>> is implemented since it would be abstracted away by the web service
>>>> interface. It could use the best XML libraries available today and written
>>>> in the programming language of choice, both of which would make maintenance
>>>> and the addition of new features much easier.
>>>>
>>>> Weston
>>>>
>>>
>
>
> On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <dmsmith at crosswire.org> wrote:
>
>> On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:
>>
>>> On 5 April 2010 13:55, Manfred Bergmann<manfred.bergmann at me.com>  wrote:
>>>
>>>
>>>> Hi DM.
>>>>
>>>> Am 05.04.2010 um 13:21 schrieb DM Smith:
>>>>
>>>>
>>>>
>>>>> Regarding using a "real" parser, it is a good idea. But we don't want
>>>>> SWORD to be dependant on an external parser.
>>>>>
>>>>>
>>>> What's the reason for that?
>>>> I could understand if it would mean for the user to install certain
>>>> libraries manually but when the sources can be integrated into the project
>>>> and has the appropriate licence then why not?
>>>>
>>>>
>>>> Manfred
>>>>
>>>>
>>>>
>>> IMHO there is no harm in bringing in libxml or a much more lightweight
>>> parser like GMarkup. The build system just needs to be adjusted to
>>> link e.g. libxml for the osis2mod binary and not shared sword library.
>>> in can be even called a new tool osisxml2mod for example and make it
>>> be build optionally such that you can still have full sword dev
>>> environment without libxml.
>>>
>>> Tools for creating modules do not have be linked with sword or even
>>> live in sword taball / svn. Although it does help consistent
>>> distribution of tools.
>>>
>>>
>>  I don't remember all of Troy's reasoning when I argued for a true parser.
>>
>> >From what I recall:
>> o To maintain freedom to re-license SWORD (e.g. for some other Bible
>> society) we need to be able to keep 3-rd party library dependencies well
>> managed. The license needs to be compatible with the GPL but cannot be GPL.
>>
>> o The parser that we have is minimal and simple, sacrificing accuracy and
>> completeness for speed. Regarding accuracy, e.g. the parser allows for
>> spaces around = in attribute declarations. Regarding completeness, e.g. it
>> does not handle namespaces, cdata, dtds/schemas, .... Significantly, it does
>> not require a well-formed document, allowing for fragments. Rather than an
>> error, it continues when an xml parser is required to stop.
>>
>> o This parser has better error reporting in that it is based upon
>> knowledge of the input. E.g. it reports the verse having the problem.
>>
>> o By SWORD having the parser, we are not dependent on finding an
>> implementation for every platform (e.g. Windows).
>>
>> There may be other reasons. I'm willing to live with it.
>>
>> But what we really need is not a parser but a tokenizer. I'm thinking
>> about writing one (my degree work was in compiler writing). Basically, we
>> repeat the same tokenization code in several places. It should be trivial to
>> write a complete, accurate one.
>>
>> In His Service,
>>     DM
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20100406/753c3bb3/attachment-0001.html>