V11n was Re: [sword-devel] Jonah 1.17 / 2.1

Thu Mar 23 15:56:27 MST 2006

I realize that Bibletime, and obviously JSword, don't see much benefit 
in the SWORD engine, as it stands now.

Basically, what you are both suggesting is removing the entire concept 
of a common engine-- which JSword doesn't use now anyway-- and Bibletime 
seem to currently have a 'work in spite of the limitations of' mentality.

I obviously don't condone such a move.  It will remove the synergy 
between all of our efforts, providing nothing more than OSIS documents 
and XSLT, in common.  These 2 things are not a bad thing, but currently 
we have so much more to offer than just these.

I would encourage teams to better work together and consider the 
contribution they might make to all projects by augmenting the engine, 
and thus embracing the cooperative environment we share here at CrossWire.

It is obviously your choice to proceed how you feel the Lord is 
directing you, but I would hope you embrace the benefits we've gleaned 
for 13 years of having a common framework and codebase.

	-Troy.

DM Smith wrote:
> 
> 
> Martin Gruner wrote:
>> DM,
>>
>> your proposal is excellent. Working with OSIS files directly is 
>> something Joachim and I have talked about already, seems to be a good 
>> way to go.
>>   
> 
> Joe and I have talked about "OSIS direct" too and we think that it is 
> the next big architectural addition for JSword. So what I outlined is 
> what we have been figuring out. It just happens that "OSIS direct" gives 
> us v11n for free.
> 
> I know that v11n is "next" for Sword, too. I just want to make sure that 
> JSword can handle whatever Sword decides for v11n, but if we can lead 
> that's great too.
> 
> So, I am planning to get started after I finish the KJV work.
> 
>> For the mapping, there would have to be some kind of object that is 
>> able to "translate" OsisIDs from one v11n scheme to another. This 
>> could probably be done by using an "absolute" (theoretical, 
>> nonexistent) v11n scheme and mapping all others to this one. With your 
>> system, this would not have to care about the order of the books. 
>> Might be done with a (c)lucene index too. I'll take this part if you 
>> do the module access. =)
>>   
> I haven't written in C in ages. So maybe someone can port the Java code 
> after I write it.
> 
>> Are clucene and lucene (and lucene4c etc.) indexes identical, and 
>> portable?   
> 
> Troy and I did some experiments with this using clucene and lucene for 
> 1.4.3. The indexes were not identical from a byte comparison. However, 
> they were identical from a practical perspective. They worked just fine 
> for both giving the same results for a set of queries. There is also a 
> python and perl ports and perhaps others.
> 
> They are also are portable to any OS.
> 
> I also tested compatibility with the upcoming lucene 2.0 (currently 
> called 1.9.1) Lucene 1.9.1 can read indexes created by 1.4.3 without any 
> problem, but 1.4.3 can't use indexes built by 1.9.1.
> 
> 
>> Could they be distributed and used by different frontends in parallel?
>>   
> 
> I'm not sure what you are asking. We can zip them up and move them to 
> different machines. So they can be distributed.
> Two applications, same or different, can use the same index at the same 
> time. When the index is being modified, it is locked for any other 
> threads or processes.
> 
>> You are aware of the fact that this would mean a complete paradigm 
>> shift for the Sword API?
>>   
> 
> I realize that it is very different. I also think it is a bit simpler, 
> too. From a client perspective, I don't think it is that big a shift. 
> Within JSword, we code to interfaces and I don't think the basic client 
> ones will have to change at all.
> 
> The big question is whether it is a proper direction and one that we are 
> all willing to embark.
> 
> If so we will have to have a new "module" type for the conf and it will 
> need to be version specific to lucene (e.g. MinimumLucene=1.4.3)
> 
> Once I get the KJV (nearly) done, I'll build an index for it as I have 
> described. (but not with the extra contexts at first). Then we can play 
> with the index to see if there are any gotchas that we did not anticipate.
> 
> I think I should be pretty near done with the KJV in a couple of weeks.
> 
>> mg
>>
>> Am Donnerstag, 23. März 2006 19:11 schrieb DM Smith:
>>  
>>> DavidTroidl at aol.com wrote:
>>>    
>>>> Hi,
>>>>
>>>> I also have several issues with osis2mod, and I was getting ready to
>>>> post.  The fact is that there are several versification schemes for
>>>> both Old and New Testaments.  I was having a similar problem with
>>>> re-versification in Tischendorf's Greek New Testament.  It has John
>>>> 1:52, because an earlier verse is sub-divided.  But it also has 3John
>>>> 15 and Rev 12:18, which agrees with UBS 4.
>>>>
>>>> How can we get osis2mod to recognize true variations in versification,
>>>> and not "standardize" everything?
>>>>       
>>> A SWORD module consists of text (possibly compressed) and an index into
>>> that text. (Compressed modules will have additional tables marking the
>>> start and end of the compression unit. But I am ignoring them in the
>>> discussion below.)
>>>
>>> In a nutshell, the code needs to be changed both that which creates the
>>> index and that which reads it.
>>>
>>> Here is an overview of how it all hangs together. This may be a bit
>>> imprecise because the JSword implementation, which I work on and am
>>> familiar, may be slightly different from the actual SWORD API
>>> implementation.
>>>
>>> The index is a big fixed size array with each entry giving the start and
>>> length of each verse. There are slots for "introductions" to chapters
>>> and books, e.g. Gen.0 would give the intro to Genesis and Gen.1.0 would
>>> give an introduction to Genesis Chapter 1.
>>>
>>> Lookup happens in this fashion, the verse reference is first normalized
>>> (e.g. Matthew 1:5 might become Matt.1.5) And then this is re-normalized
>>> into 40.1.5. Then that normalization is converted into an index into the
>>> fixed size array via a lookup table.
>>>
>>> In the same fashion, the index is created. As the input is parsed, the
>>> verse body is substringed and titles which are immediately before the
>>> verse are marked as pre-verse and prepended to the verse. The verse
>>> reference is converted into the array index. The verse is written to the
>>> output file and the start of that verse in the output file is recorded
>>> in the index along with its length.
>>>
>>> You will note that the verses are laid down in the output file in the
>>> order that they are in the input file. If a verse exists more than once
>>> in the input, I think both get written to the output file, but the last
>>> one over-writes the first in the index. If a verse pertains to more than
>>> one KJV verse (e.g. <verse osisID="Gen.1.1 Gen.1.2"> text of Genesis 1.1
>>> and Genesis 1.2</verse>) then this is recorded in two index slots that
>>> point to the same place in the output file. It is possible to feed a
>>> correction to a module of just the changed verses. This will then be
>>> appended to the output file and the index will be updated to reflect the
>>> new material. The old material still remains.
>>>
>>> When a verse reference is outside of the KJV v11n, it is recognized as a
>>> problem. Now there are only so many ways that the program can handle it.
>>> It could reject it. Or in the case of JSword, if the "book" and
>>> "chapter" are in the KJV v11n, then it figures out which verse is really
>>> meant by adding it to start of the chapter. So Matt 1:27 would silently
>>> become Matt 2:2. Later when Matt 2:2 is seen, it would overwrite the
>>> earlier entry in the index and Matt 1:27 would be lost. There may be
>>> other strategies. But in every case it will not produce the desired
>>> results.
>>>
>>> Here is how I would suggest implementing a solution to this problem: use
>>> OSIS documents and use lucene with osisIDs as the keys.
>>>
>>> I have found that lucene is very fast. Input references would be
>>> normalized to osisIDs and these be used for lookup. Rather than storing
>>> the document in this index, the original would be left on disk as is
>>> (perhaps compressed by verse, chapter or book as we do today). The index
>>> would store start offset and end offset for each and every osisID in the
>>> document. The start offset would be to the beginning of the element and
>>> the end offset would be to the end of the element. In the case of
>>> milestoned elements, it would be from the start of the sID element to
>>> the end of the corresponding eID element. It could also handle multiple
>>> documents by storing the document names as well.
>>>
>>> Handling a "passage", say Gen 50:2 - Ex 2 would become an osisRef of
>>> Gen.50.1-Exod.2. This in turn would indicate the start and end of the
>>> fragment in the document as the start offset of Gen.50.1 and the end
>>> offset of Exod.2.
>>>
>>> This solution allows:
>>>     for books of the bible to be in any order as required for a
>>> particular work.
>>>     for there to be any number of chapters in a book,
>>>     for there to be any number of verses in a chapter
>>>     for there to be prefaces, introductions, titles, colophons,
>>> appendices  and any other elements allowed by OSIS.
>>>     for the apocrypha to be before or after the NT or in a separate 
>>> file.
>>>     for each book or a set of books to be in separate files (in fact,
>>> one could go to the absurd level of doing it by paragraph).
>>>     for any other book (e.g. dictionary, Koran, ...) with a well defined
>>> hierarchical system of reference to be index or stored.
>>>     for the OSIS documents to be used for any other purpose by any other
>>> system that can handle OSIS docs (ignoring compression and encryption;)
>>>        (Maybe we don't want this last one;)
>>>
>>> I would also advocate storing two other contexts: one for a minimal
>>> well-formed xml fragment and one for a minimum display context (which
>>> would also be a well-formed xml fragment) The reason for these is that
>>> OSIS does not require that a verse, chapter or any other division be
>>> well formed. It only requires that the divs that are children of the
>>> osisText element be well formed.
>>>
>>> Well-formedness is a requirement for using xml processors (which JSword
>>> uses). So having a minimal xml context will solve that.
>>>
>>> The display context is needed to provide enough information to render
>>> the verse correctly. Two examples: First, in poetry (e.g. a Psalm), a
>>> verse may be wholly contained in a line of a "poem" and thus be well
>>> formed, but unless it is seen as part of the whole, it cannot be
>>> correctly rendered. Second, consider the word's of Jesus (always a good
>>> idea:). It may be that a much earlier verse records that the selected
>>> verse are the words of Jesus and a much later verse records that it his
>>> speech ends. Looking at the verse in isolation, it is impossible to know
>>> that the verse contains the Jesus' words. So in trying to apply
>>> red-letter text to his words would fail when looking at the verse alone.
>>> The trick would be deciding what constitutes a display context. It
>>> should at least encompass the larger of the paragraphs, quotes, speeches
>>> or line groups in which the verse appears/intersects, if any.
>>>
>>> The other advantage to using Lucene is that the indexes can be changed
>>> to add more information at a later time and existing processes would not
>>> need to be changed unless they were to take advantage of the additions.
>>> A given application, say BibleTime, could augment the index with further
>>> information (e.g. notes, internal processing info, ...) and BibleDesktop
>>> could use that index without needing to handle that additional info.
>>>
>>> Of course, the above does not solve the mapping of one v11n scheme to
>>> another.
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page