<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html; charset=ISO-8859-1"

 http-equiv="Content-Type">

</head>

<body text="#000000" bgcolor="#ffffff">

On 04/05/2010 01:44 PM, Weston Ruter wrote:

<blockquote

 cite="mid:i2sfb8299e11004051044g6691edfepc871c8b9b587c991@mail.gmail.com"

 type="cite">DM:<br>

  <br>

  <blockquote

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"

 class="gmail_quote">But what we really need is not a parser but a

tokenizer. I'm thinking about writing one (my degree work was in

compiler writing). Basically, we repeat the same tokenization code in

several places. It should be trivial to write a complete, accurate one.<br>

  </blockquote>

  <br>

I've also been wanting to work on a tokenizer. At Open Scriptures, the

text of a work is currently represented by two <a

 moz-do-not-send="true"

 href="http://github.com/openscriptures/api/blob/master/models.py">models</a>

(database tables): <a moz-do-not-send="true"

 href="http://github.com/openscriptures/api/blob/master/models.py#L242">Token</a>

and <a moz-do-not-send="true"

 href="http://github.com/openscriptures/api/blob/master/models.py#L315">Structure</a>.

Tokens are the smallest divisible units of text, such as words,

punctuation, and whitespace; and structures are the spans of tokens

that form logical units, such as verses, paragraphs, quotes, etc. The

structures are standoff-markup for the tokens. With the underlying data

stored in this way, it can then be serialized in whichever hierarchy

desired (book-section-paragraph, book-chapter-verse, all-milestoned,

etc) or whichever data format is needed (OSIS, SWORD Module, XHTML,

etc.)<br>

</blockquote>

<br>

This is a lot lower than what an xml tokenizer needs. This would be a

tokenizer for the text between tags. Having a single tokenizer that

does both would be more efficient when both are wanted and slower when

only xml tokens are needed.<br>

<br>

I think a model could be constructed that could do both and allow one

to ask for the depth of tokenization that is needed.<br>

<br>

There is a big complication with the parsing of text: it is language

dependent. For example, Thai has words but not word breaks. Basically,

the task will require a Unicode and somewhat language aware word-break

algorithm. The best I've seen is in ICU.<br>

<br>

Lucene has a wonderful example in their Jira issues database of how to

do tokenization. (1488, if I remember.)<br>

<br>

<br>

<blockquote

 cite="mid:i2sfb8299e11004051044g6691edfepc871c8b9b587c991@mail.gmail.com"

 type="cite"><br>

So what I'm currently rumenating on is the process of importing the raw

data into the Token and Structure models. I wrote an <a

 moz-do-not-send="true"

 href="http://github.com/openscriptures/api/blob/master/importers/Tischendorf-2.5.py">importer</a>

for the Tischendorf GNT data which does everything both tokenizing and

parsing, but obviously there is going to be a lot of code in common

with other importers that are written. So I too am thinking about how

these importers can be reduced to the bare minimum to handle the unique

aspects of the raw data (i.e. normalize it), and then stream the tokens

back to a central importer that parses the input and stores it into the

Token and Structure models. This central importer facility could be a

web service.<br>

  <br>

I've love to collaborate with you on this. We could come up with a

common tokenizer that can be used by both SWORD and Open Scriptures.

The importer web service could take tokens as input and as output

generate a SWORD module and also populate the Open Scriptures models at

the same time.<br>

  <br>

Thoughts?<br>

</blockquote>

<br>

Sounds good to me, too.<br>

<br>

In Him,<br>

&nbsp;&nbsp;&nbsp; DM<br>

<br>

<blockquote

 cite="mid:i2sfb8299e11004051044g6691edfepc871c8b9b587c991@mail.gmail.com"

 type="cite"><br>

Weston<br>

  <br>

  <br>

  <br>

  <div class="gmail_quote">On Mon, Apr 5, 2010 at 10:24 AM, Daniel

Owens <span dir="ltr">&lt;<a moz-do-not-send="true"

 href="mailto:dhowens@pmbx.net">dhowens@pmbx.net</a>&gt;</span> wrote:<br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Yes,

I agree, and if there were a feedback mechanism for the module creator

to let them know how to start fixing an OSIS file or conf file, it

would save Chris (or whoever else approves modules) time on the basic

stuff.<br>

    <font color="#888888"><br>

Daniel</font>

    <div class="im"><br>

    <br>

On 4/5/2010 11:09 AM, DM Smith wrote:<br>

    </div>

    <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

      <div class="im">This is a great idea. Rather than emailing source

to modules at crosswire dot org, one could upload it via a web service.

We could have stages of validation (xmllint) and construction

(osis2mod). Such a service could evaluate the quality of the submission.<br>

      <br>

In Him,<br>

&nbsp; &nbsp;DM<br>

      <br>

On 04/05/2010 12:01 PM, Weston Ruter wrote:<br>

      </div>

      <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

        <div class="im">Why not turn osis2mod into a web service? Then

it wouldn't matter how it is implemented since it would be abstracted

away by the web service interface. It could use the best XML libraries

available today and written in the programming language of choice, both

of which would make maintenance and the addition of new features much

easier.<br>

        <br>

Weston<br>

        </div>

      </blockquote>

    </blockquote>

  </blockquote>

  </div>

  <br>

  <br>

  <br>

  <div class="gmail_quote">On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <span

 dir="ltr">&lt;<a moz-do-not-send="true"

 href="mailto:dmsmith@crosswire.org">dmsmith@crosswire.org</a>&gt;</span>

wrote:<br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div class="im">On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:<br>

    <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

On 5 April 2010 13:55, Manfred Bergmann&lt;<a moz-do-not-send="true"

 href="mailto:manfred.bergmann@me.com" target="_blank">manfred.bergmann@me.com</a>&gt;

&nbsp;wrote:<br>

&nbsp; <br>

      <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi DM.<br>

        <br>

Am 05.04.2010 um 13:21 schrieb DM Smith:<br>

        <br>

&nbsp; &nbsp; <br>

        <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Regarding using a "real" parser, it is a good idea. But we don't want

SWORD to be dependant on an external parser.<br>

&nbsp; &nbsp; &nbsp; <br>

        </blockquote>

What's the reason for that?<br>

I could understand if it would mean for the user to install certain

libraries manually but when the sources can be integrated into the

project and has the appropriate licence then why not?<br>

        <br>

        <br>

Manfred<br>

        <br>

&nbsp; &nbsp; <br>

      </blockquote>

IMHO there is no harm in bringing in libxml or a much more lightweight<br>

parser like GMarkup. The build system just needs to be adjusted to<br>

link e.g. libxml for the osis2mod binary and not shared sword library.<br>

in can be even called a new tool osisxml2mod for example and make it<br>

be build optionally such that you can still have full sword dev<br>

environment without libxml.<br>

      <br>

Tools for creating modules do not have be linked with sword or even<br>

live in sword taball / svn. Although it does help consistent<br>

distribution of tools.<br>

&nbsp; <br>

    </blockquote>

    </div>

I don't remember all of Troy's reasoning when I argued for a true

parser.<br>

    <br>

&gt;From what I recall:<br>

o To maintain freedom to re-license SWORD (e.g. for some other Bible

society) we need to be able to keep 3-rd party library dependencies

well managed. The license needs to be compatible with the GPL but

cannot be GPL.<br>

    <br>

o The parser that we have is minimal and simple, sacrificing accuracy

and completeness for speed. Regarding accuracy, e.g. the parser allows

for spaces around = in attribute declarations. Regarding completeness,

e.g. it does not handle namespaces, cdata, dtds/schemas, ....

Significantly, it does not require a well-formed document, allowing for

fragments. Rather than an error, it continues when an xml parser is

required to stop.<br>

    <br>

o This parser has better error reporting in that it is based upon

knowledge of the input. E.g. it reports the verse having the problem.<br>

    <br>

o By SWORD having the parser, we are not dependent on finding an

implementation for every platform (e.g. Windows).<br>

    <br>

There may be other reasons. I'm willing to live with it.<br>

    <br>

But what we really need is not a parser but a tokenizer. I'm thinking

about writing one (my degree work was in compiler writing). Basically,

we repeat the same tokenization code in several places. It should be

trivial to write a complete, accurate one.<br>

    <br>

In His Service,<br>

    <font color="#888888"> &nbsp; &nbsp;DM</font>

    <div>

    <div class="h5"><br>

    <br>

_______________________________________________<br>

sword-devel mailing list: <a moz-do-not-send="true"

 href="mailto:sword-devel@crosswire.org" target="_blank">sword-devel@crosswire.org</a><br>

    <a moz-do-not-send="true"

 href="http://www.crosswire.org/mailman/listinfo/sword-devel"

 target="_blank">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>

Instructions to unsubscribe/change your settings at above page<br>

    </div>

    </div>

  </blockquote>

  </div>

  <br>

</blockquote>

<br>

</body>

</html>