[sword-devel] [osis-editors] Re: The death of OSIS?
Steven J. DeRose
sderose at acm.org
Thu Aug 12 09:45:02 MST 2004
At 22:09 +1000 2004-08-11, Kahunapule Michael P. Johnson wrote:
>The problem I have with OSIS (at least the version of documentation
>that I have) is that it does not encode enough information to
>reliably reconstitute quotation mark punctuation for the range of
>languages and Bible translations that I work with. It doesn't even
>cover English properly. The reason is that you state in the
>documentation that quotations should be marked with <q
>who="Nameofspeaker" sID="someuniquething">....<q who="Nameofspeaker"
>eID="someuniquething"> and NOT with the quotation marks. This is OK
>for SOME situations; to wit: standard English texts using the same
>quotation punctuation rules as the NIV, and Bible texts in languages
>that happen to use the same characters and rules for quotation
>marks. This is NOT OK for other situations; to wit: English texts
>using different quotation mark styles (like the NASB) or no
>quotation marks at all (like the KJV). It occurs to me that by just
>ignoring <q> and <speech> altogether, I could put in the normal
>quotation punctuation for the given language as Unicode characters
>in the right places and be happy-- except for two things.
It may well be that we all made mistakes in the design of quotation
handling in OSIS, but I assure you we considered a much wider range
of cases than the English NIV or English. Some of us are of US
origin, but even so I don't think we have any monolinguals among us.
There is a real tradeoff here -- are quotation marks conventional
ways of marking a discourse phenomenon (let's call it "quotation" to
keep things simple), or are they part of "the text"? That is not so
straightforward as it seems to me you are suggesting. There were no
quotation marks in the original texts of the Bible, so all the
quotation marks are products of someone's interpretation.
Nevertheless, we all agree that OSIS markup has to provide enough
information to get the formatted result that one wants.
Actually, let me clarify that a little: widow and orphan management
is an important part of high-quality formatting: certainly part of
"the formatted result that one wants." But surely it shouldn't be
part of what OSIS encodes. This may seem obvious or trivial, but I
have heard people criticize OSIS for just this: they look at a
printed Bible someone produced from OSIS source using some formatting
tool that doesn't do widowing well, and say "OSIS can't produce a
good Bible" -- we must always keep in mind that there are at least
two separate parts involved here: the markup and the engine that
>One is that I want to encode some (but not all) of the Bible texts
>for "red letter" editions. Actually, I don't really mean to specify
>that the words of Jesus have to be in red. I just want to mark the
>direct quotes of Jesus in a way that makes it easy for those who
>wish to present the Bible text to display the direct quotes of Jesus
>in red (or some other distinctive way) if they want to. I don't even
>care if people display Jesus' direct quotes in red or not, but I do
>care that if they do, the markers are in the right places so that
>the correct words are marked. I can use <q who="Jesus"
>eID="book.chapter.verse.0"> for that, but then if I do that for the
>KJV, will the application reading the OSIS file add quotation marks?
>If I use OSIS for a language that uses different quotation marks,
>what will happen? What about open quote reminders at new paragraphs
>and stanzas? Will they be inserted when they aren't supposed to be?
This is they key point, isn't it? "will the application reading the
OSIS file add quotation marks?" is not a question that can be
answered. Which application? Reasonable software for formatting XML
should do what your style sheets say it should do. Perhaps not all
software is reasonable, but even most CSS implementations give you
that much control.
Clearly the KJV and the NIV have different styles for quotations. The
style sheets you would use to generate printed versions of them
therefore would differ. They might be completely separate, or just
differ in a few things, or a very clever stylesheet might even check
what version it's formatting (by looking at the header) and do the
appropriate thing for any version it knows about, and a default thing
By not enshrining punctuation in the text itself, a wider range of
options are available to the translators, publishers, and other
concerned parties. For example, if I were printing an NIV in France
for some reason, I might want to use the French chevron-like
quotation marks (sorry, I forget the name for them just now). No
problem: tweak the stylesheet. You don't have to even touch the touch
the text itself -- thus the risk of accidentally messing it up is
reduced. This is especially important for minority languages, where
the typesetter probably doesn't know the language, and so cannot
easily detect if they messed things up.
Also, these source files will be processed by many things other than
formatters. Consider blind users with voice-generation interfaces:
they won't get quotation marks at all -- but if the system knows
there is a quote starting, it should be able to signal that to them.
One system might just say "quote" in whatever the user's language is;
a better system might generate voice inflections or suprasegmentals
of some sort to communicate the same thing. Second, consider a search
engine: it shouldn't have to search for a different pattern of
specific characters to locate quotes in every language it encounters
(especially when some patterns are ambiguous).
So, it seems to me we definitely need to have markup in there for
quotes -- the question then is whether OSIS quote markup provides
sufficient information to drive a formatter, and if not, what to do
>The other problem with controlling quotation punctuation with OSIS
>and always using markup (i. e. q or speech elements) is that there
>are not just start and end locations. There are also open quote
>reminder locations. This gets confusing. Can I specify that a
>quotation starts at a given location with one character, continues
>at a paragraph boundary with a different character, then ends with
>still another character? Would it be OK to use a duplicated sID in a
>q milestone element to indicate that this is a part of the same
>quotation, but more punctuation is needed here?
Absolutely agreed. We discussed this at length (Patrick, can we add a
section with some examples for this in the doc, if we haven't yet?).
Typically, the placement of quotation reminders is determined by some
fairly simple rule, that may differ by language, writing system,
culture, and genre (and probably other factors too). Your example of
a paragraph boundary is a very common case. In such a case, the
stylesheet rule for paragraph simply checks whether a quotation is
open, and if so, issues the appropriate punctuation.
This is a valuable approach, because there might well be two
different groups that share a translation, but live in different
areas and have become accustomed to different quotation style rules.
For example, a language group from a war-torn country where many have
emigrated, and ended up in different countries. If you put the
literal quote characters in the text for one group, you have to go
and fix it all manually for the other group. If instead you mark the
quotes via markup and have a stylesheet generate the correct
characters for display, then you just change that stylesheet, getting
a uniform change with much less effort.
Does any of us know of a situation where the placement of "reminder"
punctuation is discretionary? That is, where we have to record it
because there is no rule, or a rule so complex, that the marks cannot
reasonably be generated by a stylesheet? (I'm not including making a
facsimile edition of a copy text including errors).
In my opinion (and that of my OSIS validation code), it would be
incorrect to use a duplicate sID for this case as the OSIS schema
stands right now. It could be that there is need to explicitly mark
paragraph boundaries inside quotes, rather than letting the style
sheet do the right thing. If you believe so, can you explain it to me
in more detail? I'm not quite understanding your point here, and I
very much want to.
*If* there turns out to be such need, then I see a few simple solutions:
a) Allow additional milestones with the same sID (or possibly eID,
but I like your sID notion better)
b) Create a new empty element for the purpose, say <q-continued> or similar
c) Reserve a 'type' attribute value somewhere to distinguish this case.
If there really is need, you can simulate solution b or c right now
in OSIS by using a regular milestone and assigning it a special type
for this purpose. People (namely, the people writing stylesheets for
you or doing typesetting) might complain unless you could show why it
is in fact needed -- but if it really is, then it is.
>In short, I consider the placement of quotation punctuation and the
>selection of characters to be used for quotation punctuation to be a
>part of the Bible translation text itself, and if any encoding, like
>OSIS, cannot guarantee that these characters are maintained in their
>original locations, then that encoding is defective.
Wow. That's interesting. Let me see if I understand it right: So if I
published an NIV in France (or better, a Francophone country with an
English-speaking minority population that wants the NIV), and if I
used chevrons for quotation marks, you would say it's a different
*translation*, not just a different printing or edition or layout? I
must admit I have a hard time accepting that.
As for guaranteeing, no encoding can guarantee the result of applying
software to it. For all the encoding knows, the formatter you're
using simply throws out all punctuation marks, or even all the text.
It seems to me that that doesn't make all encodings defective. There
must be some more limited claim you're trying to get at here, but I
don't see clearly what it is. Help, please?
It seems to me that the *fact* of something being a quotation is
clearly part of the translation text, but that the punctuation marks
(or whatever) used to communicate that are part of the formatting,
just like the choice of font. I still consider them very important,
just as I consider the font choice important (printing a Bible in
Comic Sans, or in 5 pt type, would probably be a very bad thing to
do); but to me it wouldn't be changing "the text".
Can you explain this further for me if it's central to your point?
But it seems to me this is not central -- you just want the quotes
right, right? And that doesn't require anywhere near so strong a
>Do you see the problem?
I don't think so. Please explain further.
>Now, let me suggest at least two possible solutions that are easy to
>incorporate into the OSIS standard. First, let me explicitly state
>what I'm trying to accomplish:
>1. Preserve the current OPTION in OSIS to generate quotation
>punctuation with markup.
>2. Preserve the OPTION in OSIS to mark quotations by speaker for
>specialized searches or, in the case of Jesus' direct quotes, to
>color or present them in some different way.
>3. Add the OPTION to control quotation punctuation precisely for
>languages and styles that differ from the "usual" in the type and
>placement locations of quotation punctuation.
>Suggested solution number 1 (recommended):
>Document that any <q> or <speech> element marked with an attribute
>of n=" " (a blank space) should not be taken as an instruction to
>insert any quotation mark. Rather, in this case, it should be
>assumed that the correct punctuation is already in the text as a
>Unicode character (just like other kinds of punctuation). <q> or
><speech> elements not so marked would be taken as an instruction to
>insert quotation punctuation in the manner that the NIV English
>Bible does, including open quote reminders, and alternating double
>and single typographic quotes for nested quotes.
I rather like the idea I perceive here -- some signal that the
punctuation is already in the text. The stylesheet could use this in
a nicely general way. I don't think it belongs on the 'n' attribute,
but that's a minor detail.
Is there a case, though, where a stylesheet couldn't be reasonably
expected to generate all the right quotation marks? If a language
required a different quotation mark depending on the voicing of the
following consonant, or (worse) the gender of the next noun, that
would be beyond typical stylesheet mechanisms to do. I don't know of
any languages where punctuation choice depends on linguistic
phenomena that aren't already represented by other markup or layout
(like paragraph breaks). If there are, then we have a clear problem
to deal with. But given the historical development of writing
systems, that seems to me really unlikely. Anybody know an exception?
>Suggested solution number 2:
>If for some bizarre reason you are opposed to letting quotation
>punctuation exist as a normal Unicode character in the text, you
>could (1) allow the exact character to be used to be specified with
>its hexadecimal code position in the n attribute of the p or speech
>element, and (2) define two other elements to specify if open quote
>reminders are appropriate at new paragraphs and stanzas, and (3)
>specify what the open quote reminder character should be.
Parts 2 and 3 of this would go in a stylesheet, not in the text; you
can do that now. If the character(s) were to go in an attribute, they
could just go there -- no need to code in hex. But I don't think
there's anything preventing such characters in the text in OSIS now
-- so long as you do still mark the quotes (which is surely necessary
for most non-printing processing). I'd have to read the fine details
of the wording to be certain.
>Suggested solution number 3:
>Make something up-- anything that solves the problem above, and ask
>me if I think it would work or not.
>By the way, I would be happy to help you proofread and review the
>next release of OSIS documentation and schema.
Many thanks! Feedback from people who have actual concrete issues to
deal with is *very* valuable.
>>Hope you are having a great day!
>I am. It is about my bed time, now...
Steve DeRose -- http://www.derose.net
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose at acm.org or steve at derose.net
More information about the sword-devel