Details
-
Type:
Bug
-
Status:
Open
(View Workflow)
-
Priority:
Major
-
Resolution: Unresolved
-
Component/s: Text Problem
-
Labels:None
-
Environment:
N/A
Description
Unicode replacement character U+FFFD is present in two verses of Matthew 9
Matthew 9:27 Ciise intuu meeshaas ka tegayay, waxaa daba socday laba nin oo Indhala�, iyagoo qaylinaya oo leh, Ina Daa'uudow, noo naxariiso.
Matthew 9:28 Goortuu guriga galay ayay nimankii indhaha la'aa u yimaadeen. Markaasaa Ciise wuxuu ku yidhi, Ma rumaysan tihiin inaan waxan yeeli karo? Waxay ku yidhaahdeen, Haah, Sayidow.
Matthew 9:29 Markaasuu indhahooda taabtay isagoo leh, Sida rumaysadkiinnu yahay ha idiin noqoto.
Matthew 9:30 Markaasaa indhahoodii furmeen. Ciise aad buu u amray, oo ku yidhi, Iska jira; ninna yaanu ogaan.
Matthew 9:31 Laakiin iyagu way baxeen oo warkiisa ku fidiyeen dhulkaas oo dhan.
Matthew 9:32 Kolkay baxeen, bal eeg, waxaa loo keenay nin Carrabla� oo jinni qaba.
We had confirmation from SIM in October 2009 that these should be replaced by the proper curly quote. U+2019.
"Sorry to have delayed in giving you an answer regarding the replacement characters in Matthew 9. I have been out of the country and could not look things up. The corrections that you made are correct. In the printed KQA book the ' curly quotation occurs correctly in the verses 27 and 32, as well as in the .pdf files that we sent to the printers. We don't know how those two slanted quotation marks got into the text sent to you, but I am glad you found them for your formatting and have made the necessary changes."
NB. His reply was sent in the context of the KQA Go Bible.
-
- Search results for multiline s2 titles.txt
- 13/Dec/12 12:47 PM
- 1 kB
- David Haslam
-
- somali.osis.xml.character.frequency.txt
- 12/Dec/12 5:39 AM
- 3 kB
- David Haslam
Activity
Chris,
I'll discuss this with Peter and see what we can do.
SIM supplied the source text as USFM / Paratext, and it was Peter who converted the USFM to OSIS.
I already fed back to SIM about this minor glitch, and they really should have fixed it.
The printed version for the KQA does not suffer the problem.
We still have contact with the folk at SIM.
David
It's actually much worse an issue than I first observed in Dec 2009...
Most of the apostrophes in the SFM source text have been mysteriously replaced by the right single quotation mark. Trying to understand what might be the root cause.
I do have a complete set of SFM files, with the one for Matthew corrected by me.
Need to determine whether these errors occurred during the conversion from USFM to OSIS, and if so, why?
Sent an email to Peter last night, with a copy to Chris.
The USFM source text contained only 2 single right quotation marks (in Matthew 9:27,32), both of which should really be an ordinary apostrophe.
In the Somali alphabet, the apostrophe is the letter used for the sound "hamsa".
This character represents a glottal stop in Somali. i.e. Equivalent to the Arabic sound 'alef'.
File 41_MAT_SBB.SFM (as received) was ANSI encoded. The other 65 SFM files are encoded UTF-8 (without BOM).
It would appear that the occurrence of the one ANSI encoded file has led to a problem during the conversion of the 66 files from USFM to OSIS.
All the mid-word apostrophes were mysteriously replaced by a single right quotation mark, whereas all the end-of-word apostrophes were left unchanged.
Anyone casually examining only the first few SFM files (GEN through DEU say) would easily have deduced that the whole translation was probably encoded as UTF-8. The fact that just one wasn't could not have been readily foreseen.
Whatever else it does, usfm2osis.pl outputs a single OSIS XML file for the whole Bible. Perhaps the exceptional encoding of 41_MAT_SBB.SFM caused the whole output file to be treated differently than UTF-8, with some sort of hidden rule being applied to the parsing of what were apostrophes.
These conjectural conclusions need to be checked by examining the OSIS XML file used to make the module.
I don't have a copy, though Peter may still have it.
There was once a line in usfm2osis.pl that changed apostrophes to curly quotes (u+0027 -> u+2019) iff they appeared between two word characters (letters, numbers, etc.):
$line =~ s/(\w)\'(\w)/"$1" . chr(0x2019) . "$2"/eg;
It's commented out now, but still visible in the source. SomKQA would have been converted from USFM to OSIS at a time when this was still active.
I don't think that Matthew being ASCII would have any affect on anything since ASCII-encoded text is also UTF-8-encoded text. (I think you probably meant ASCII rather than ANSI. If you meant Latin-1/CP1252, there could be some errors, but I don't think Somali uses any symbols outside of the ASCII range.)
I just discovered that I do still have a copy of the OSIS XML file which Peter prepared and sent me on 2009-06-18 or shortly afterwards.
This file contains:
7823 single right quotation marks
222 apostrophes
It is therefore clear that the process of "cleaning up the SFM files" or converting the SFM to OSIS is when the issue was first caused.
Chris's input finally clinches it.
It was the line in usfm2osis.pl that changed apostrophes to curly quotes (u+0027 -> u+2019) iff they appeared between two word characters (letters, numbers, etc.)!
This may have made some kind sense for a few English[-like] translations, but it was too severe for general use.
I have just updated my copy of the OSIS XML file, replaced all U+2019 and both U+FFFD by U+0027.
I'll send a copy to Chris, so that he can rebuild the module, including updating the conf file.
NB. When Bruce replied to me in October 2009, I'm certain he was going just by visual appearance rather than Unicode NCR, when he advised regarding the two replacement characters in Matthew 9.
Just wondering ....
"It's commented out now, but still visible in the source. SomKQA would have been converted from USFM to OSIS at a time when this was still active"
.... whether there were any other modules adversely affected by that line having been active for a period of time.
Please hold ...
Though I've already sent an updated OSIS XML file to Chris & Peter, I've just also made a further cosmetic change.
History_1.1=Corrected the OSIS source text, apostrophes in place of single quotation marks. Cosmetic improvement: Added blank paragraph before section titles. (2012-12-12)
I've rebuilt the module, and am currently testing it with Xiphos.
So far, the updated SomKQA module version 1.1 has been tested with Xiphos, Bible Desktop and xulsword.
Also in And Bible on an Android 4 tablet (local manual install).
Some front-ends have minor presentation quirks relating to the placement of verse tags in relation to chapter titles. Looking at these is not within the scope of this issue.
Updated module and source text has been sent to Chris, with a copy to Peter.
Just to confirm that I checked the validity of the updated OSIS XML file against the DTD. It is still valid. ![]()
Please hold!
While browsing the document with And Bible, I just came across two spurious USFM tags in Psalm 119, both \s2
These were in the received SFM file, as lines with the tag but no related title text.
They should simply be removed.
Search 19_PSA_SBB.SFM for regexp "^\\s2$", and you get 2 hits.
Line 7: \s2
Line 6213: \s2
I have also found that many of the </title> end tags for Psalm titles are in the wrong place!
This is because the USFM file had multiple line titles for some \s2 tags. Example:
\c 4
\s2 Kanu waa sabuur Daa'uud u tiriyey madaxdii
muusikaystayaasha, oo waxaa lagu qaadaa alaab xadhko leh oo
muusiko ah.
\q
This was badly converted by usfm2osis.pl as follows:
<chapter sID="Ps.4" osisID="Ps.4"/>
<title>Kanu waa sabuur Daa'uud u tiriyey madaxdii</title>
muusikaystayaasha, oo waxaa lagu qaadaa alaab xadhko leh oo
muusiko ah.
...
It should have been as follows:
<chapter sID="Ps.4" osisID="Ps.4"/>
<title>Kanu waa sabuur Daa'uud u tiriyey madaxdii
muusikaystayaasha, oo waxaa lagu qaadaa alaab xadhko leh oo
muusiko ah.</title>
...
Fixing this new problem requires either a lot of manual editing, or the use of an improved conversion script.
With Chris's new Python script usef2osis.py intended to be a better method than our old Perl script, it would be a good opportunity to test this, before I spent a lot of time doing something manually that would be much better automated.
There are only 21 such multi-line \s2 titles, and all of them are in Psalms.
Therefore an ad hoc correction is feasible without too much manual effort.
Regexp search for \s2 tags in merged.sfm.txt
All the 21 instances are within 19_PSA_SBB.SFM
Fixed these 21 titles by manually editing file somali.updated.osis.xml
Rebuilt module as version 1.2 with updates to the conf file.
Tested in Xiphos - these titles now look OK.
Both my installed JSword apps, Bible Desktop & And Bible do not display any s2 titles in the Psalms.
Neither does xulsword, which is a branch from SWORD.
Something strange going on here. I need help.
The titles in other books are displayed as expected, as far as I can determine.
Even Xiphos has a problem with some Psalm titles!
It does not display the major titles for the five "books" of Psalms.
I think the book of Psalms in SomKQA may be badly structured as regards the use of the <section> element.
The XML needs to be reviewed for this book. The output of usfm2osis.pl was flawed.
I'm not sure any front end will handle the 5 Psalm book titles.
These are titles that should stand before the chapter that starts the new section. They probably should be put by osis2mod in verse 0 for the chapter, the same as a chapter title.
Dumping the book with mod2imp should give good debug info.
The fourth 'major section' in Psalms should be for Psalms 90-106, but there is a premature </div> at the end of Psalm 97 followed by the start of a new section in Psalm 98.
Also, each stanza of Psalm 119 is a section of its own.
So there are 28 sections in the Psalms in total (5 + 1 + 22).
These sections are all displayed at the same "folding level" in XML Copy Editor.
It may have been better to have used <div type="majorSection"> for the five books of Psalms.
This attribute value was not used.
Thanks DM for your input. I have not tried the nightly build for BD.
And remember, I'm using Windows 7 (64-bit), so I'd probably need a special CMD file merely to open it, were I to download it.
Maybe I should let you have an interim copy of the module?
And Bible 1.6.0 does display "chapter titles" (in the other 65 books).
In XML however, these are actually "section titles" that just happen to be before verse 1. Example:
<chapter sID="Matt.10" osisID="Matt.10"/>
<div type="section"><p></p>
<title>Ciise Laba Iyo Tobankiisii Rasuul</title>
<title type="parallel">(Mar. 3:13-19; Luuk. 6:12-16)</title>
<p>
<verse sID="Matt.10.1" osisID="Matt.10.1"/>
Markaasuu wuxuu u yeedhay laba-iyo-tobankii xertiisa ahaa oo wuxuu siiyey amar ay jinniyo wasakh leh ku saaraan oo ay bugto walba iyo cudur walba ku bogsiiyaan.
</p>
The minor quirk is that the verse 1 tags have a line feed before the verse text, iff the chapter has a title.
I have just sent the latest interim build (version 1.2) and OSIS XML file to DM Smith by email.
Yesterday, I sent the same files to Chris, with a cc: to Peter (fio).
Chris,
Please can you fix this issue. The SIM email I cited above should be sufficient authority.
Thanks.
David