[sword-devel] Improvements to osis2mod to handle XML comments and <header> correctly

DM Smith dmsmith at crosswire.org
Mon Apr 5 11:18:35 MST 2010

On 04/05/2010 01:44 PM, Weston Ruter wrote:
> DM:
>     But what we really need is not a parser but a tokenizer. I'm
>     thinking about writing one (my degree work was in compiler
>     writing). Basically, we repeat the same tokenization code in
>     several places. It should be trivial to write a complete, accurate
>     one.
> I've also been wanting to work on a tokenizer. At Open Scriptures, the 
> text of a work is currently represented by two models 
> <http://github.com/openscriptures/api/blob/master/models.py> (database 
> tables): Token 
> <http://github.com/openscriptures/api/blob/master/models.py#L242> and 
> Structure 
> <http://github.com/openscriptures/api/blob/master/models.py#L315>. 
> Tokens are the smallest divisible units of text, such as words, 
> punctuation, and whitespace; and structures are the spans of tokens 
> that form logical units, such as verses, paragraphs, quotes, etc. The 
> structures are standoff-markup for the tokens. With the underlying 
> data stored in this way, it can then be serialized in whichever 
> hierarchy desired (book-section-paragraph, book-chapter-verse, 
> all-milestoned, etc) or whichever data format is needed (OSIS, SWORD 
> Module, XHTML, etc.)

This is a lot lower than what an xml tokenizer needs. This would be a 
tokenizer for the text between tags. Having a single tokenizer that does 
both would be more efficient when both are wanted and slower when only 
xml tokens are needed.

I think a model could be constructed that could do both and allow one to 
ask for the depth of tokenization that is needed.

There is a big complication with the parsing of text: it is language 
dependent. For example, Thai has words but not word breaks. Basically, 
the task will require a Unicode and somewhat language aware word-break 
algorithm. The best I've seen is in ICU.

Lucene has a wonderful example in their Jira issues database of how to 
do tokenization. (1488, if I remember.)

> So what I'm currently rumenating on is the process of importing the 
> raw data into the Token and Structure models. I wrote an importer 
> <http://github.com/openscriptures/api/blob/master/importers/Tischendorf-2.5.py> 
> for the Tischendorf GNT data which does everything both tokenizing and 
> parsing, but obviously there is going to be a lot of code in common 
> with other importers that are written. So I too am thinking about how 
> these importers can be reduced to the bare minimum to handle the 
> unique aspects of the raw data (i.e. normalize it), and then stream 
> the tokens back to a central importer that parses the input and stores 
> it into the Token and Structure models. This central importer facility 
> could be a web service.
> I've love to collaborate with you on this. We could come up with a 
> common tokenizer that can be used by both SWORD and Open Scriptures. 
> The importer web service could take tokens as input and as output 
> generate a SWORD module and also populate the Open Scriptures models 
> at the same time.
> Thoughts?

Sounds good to me, too.

In Him,

> Weston
> On Mon, Apr 5, 2010 at 10:24 AM, Daniel Owens <dhowens at pmbx.net 
> <mailto:dhowens at pmbx.net>> wrote:
>     Yes, I agree, and if there were a feedback mechanism for the
>     module creator to let them know how to start fixing an OSIS file
>     or conf file, it would save Chris (or whoever else approves
>     modules) time on the basic stuff.
>     Daniel
>     On 4/5/2010 11:09 AM, DM Smith wrote:
>         This is a great idea. Rather than emailing source to modules
>         at crosswire dot org, one could upload it via a web service.
>         We could have stages of validation (xmllint) and construction
>         (osis2mod). Such a service could evaluate the quality of the
>         submission.
>         In Him,
>            DM
>         On 04/05/2010 12:01 PM, Weston Ruter wrote:
>             Why not turn osis2mod into a web service? Then it wouldn't
>             matter how it is implemented since it would be abstracted
>             away by the web service interface. It could use the best
>             XML libraries available today and written in the
>             programming language of choice, both of which would make
>             maintenance and the addition of new features much easier.
>             Weston
> On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <dmsmith at crosswire.org 
> <mailto:dmsmith at crosswire.org>> wrote:
>     On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:
>         On 5 April 2010 13:55, Manfred
>         Bergmann<manfred.bergmann at me.com
>         <mailto:manfred.bergmann at me.com>>  wrote:
>             Hi DM.
>             Am 05.04.2010 um 13:21 schrieb DM Smith:
>                 Regarding using a "real" parser, it is a good idea.
>                 But we don't want SWORD to be dependant on an external
>                 parser.
>             What's the reason for that?
>             I could understand if it would mean for the user to
>             install certain libraries manually but when the sources
>             can be integrated into the project and has the appropriate
>             licence then why not?
>             Manfred
>         IMHO there is no harm in bringing in libxml or a much more
>         lightweight
>         parser like GMarkup. The build system just needs to be adjusted to
>         link e.g. libxml for the osis2mod binary and not shared sword
>         library.
>         in can be even called a new tool osisxml2mod for example and
>         make it
>         be build optionally such that you can still have full sword dev
>         environment without libxml.
>         Tools for creating modules do not have be linked with sword or
>         even
>         live in sword taball / svn. Although it does help consistent
>         distribution of tools.
>     I don't remember all of Troy's reasoning when I argued for a true
>     parser.
>     >From what I recall:
>     o To maintain freedom to re-license SWORD (e.g. for some other
>     Bible society) we need to be able to keep 3-rd party library
>     dependencies well managed. The license needs to be compatible with
>     the GPL but cannot be GPL.
>     o The parser that we have is minimal and simple, sacrificing
>     accuracy and completeness for speed. Regarding accuracy, e.g. the
>     parser allows for spaces around = in attribute declarations.
>     Regarding completeness, e.g. it does not handle namespaces, cdata,
>     dtds/schemas, .... Significantly, it does not require a
>     well-formed document, allowing for fragments. Rather than an
>     error, it continues when an xml parser is required to stop.
>     o This parser has better error reporting in that it is based upon
>     knowledge of the input. E.g. it reports the verse having the problem.
>     o By SWORD having the parser, we are not dependent on finding an
>     implementation for every platform (e.g. Windows).
>     There may be other reasons. I'm willing to live with it.
>     But what we really need is not a parser but a tokenizer. I'm
>     thinking about writing one (my degree work was in compiler
>     writing). Basically, we repeat the same tokenization code in
>     several places. It should be trivial to write a complete, accurate
>     one.
>     In His Service,
>        DM
>     _______________________________________________
>     sword-devel mailing list: sword-devel at crosswire.org
>     <mailto:sword-devel at crosswire.org>
>     http://www.crosswire.org/mailman/listinfo/sword-devel
>     Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20100405/7e7808e6/attachment-0001.html>

More information about the sword-devel mailing list