[sword-devel] Searching for hyphenated words?

DM Smith dmsmith at crosswire.org
Sat Mar 2 09:42:02 MST 2013


I see two different questions being posed:
a) The correctness of using an ndash within a word.
b) The ability to search for words containing ndash or any kind of dash, including a simple hyphen.

I'll start with my conclusion: Changing the ndash to a simple hyphen does not really address the questions.

Regarding correctness:
The usage of ndash in the KJV is within names only. At the bottom, I've included a list of the names having an ndash. In the 2003 version of the 1769 KJV, these words were not hyphenated. They were hyphenated with an ndash in the 2006 cleanup. As an interesting aside, I looked at some of the non-name words that are hyphenated in the 1769 KJV and compared them to a photocopy of the 1611. These are word such as God-ward, us-ward, thee-ward, joint-heirs, .... My search was not exhaustive, but the 1611 didn't have hyphens, but either concatenated the words as with the -ward suffixes or with a space as in joint heirs. The other thing I noticed was that in each case where the KJV (either 1769 or 1611) had a hyphenated name, it was a Hebrew transliteration of some sort and had an attached note to at least one of the instances.

One question is whether they should be taken as a whole or parts? So, is Beth–el, equivalent to Beth el or to Bethel? Another question, does a dash (hyphen, ndash, mdash, ...) have the same meaning today as it did hundreds of years ago? Same question but regarding different languages: Do different languages use a dash with different semantics than modern English?

Regarding search:
This regards several issues:
How does Lucene handle these different characters?
What does an end user want/expect?
Can we leverage that to meet user expectation?

Lucene's handling:
Lucene uses an Analyzer to split text into words on punctuation for indexing and for search. JSword uses SimpleAnalyzer because it makes no further assumptions on the text. SWORD lib uses StandardAnalyzer which does. I think the StandardAnalyzer has special rules for hyphens. In Lucene 3.6 the StandardAnalyzer behavior changes to use UAX 29 rules for splitting the text. This is a huge step forward. I don't know whether it handles '-' differently than other punctuation. (JSword switched from the StandardAnalyzer to the SimpleAnalyzer very early on because of the extra assumptions that StandardAnalyzer makes about what the user wants to index and not index and because it was significantly slower.)

With the SimpleAnalyzer a dash (hyphen, ndash, mdash) are used to create phrases. As such Beth–el, Beth-el and "Beth el" are equivalent. (This is with Lucene 3.0.3, earlier versions may differ). Note, it really doesn't matter that it's a dash, any punctuation will do. I don't think this is the case with the StandardAnalyzer.

One of the impacts of having hypenated words is that searching for Bethlehem won't find Beth–lehem. (The NT and OT differ on the spelling in the KJV.) It doesn't matter what kind of dash is used. The user cannot omit the hyphen to concatenate the words.

Another impact of hyphenated words is that it is much harder to do a wild card search. It doesn't matter what kind of dash is used. If the search request has a dash a * cannot be used.

So Lucene can do the right thing wrt the ndash and hyphen. They are identical wrt indexing and searching. The user does not have to know the form that is used in the file and match that.

The other feature that Lucene offers out of the box is Fuzzy Searching. I will find close approximations to the word that you are requesting. All that needs to be done is append a ~ to the end of the word. For example, Abimelek~ finds Abimael, Abimelech, Abiezer and Ahimelech. This is not a Soundex search, so the results are often surprising. Bethelham~ finds Meshullam and Bethlehem~ finds betrothed but not Bethlehem.

Some front-ends don't use Lucene for indexing. Some use an older version. So the behavior can differ.
Also, SWORD doesn't require indexing for "slow" search. Don't know if the SWORD "slow" search treats the various dashes the same or differently. (I think this is the Multi-word search mentioned by David)

User expectation:
The hyphenation of these names is not common in other translations. I think that most users would expect Bethel and not Beth–el or Beth-el. Together this makes searching multiple Bibles at the same time very difficult.

I think that a user might have a reasonable expectation not knowing that proper spelling of more than a few of them. Let alone that they are hyphenated. 

Leveraging:
I think that if StandardAnalyzer does not give expected behavior then SimpleAnalyzer should be used.

I think that hyphenated words should also be indexed as unhyphenated.

Adding a simple filter to change different forms of dashes into a single form for both search and index is a good solution but would break backward compatibility with existing indexes and changing from StandardAnalyzer to SimpleAnalyzer would be as much of a pain and a better solution (at least until 3.6, which I have not evaluated to see if it changes the behavior sufficiently.)

Conclusion: Changing the ndash to a simple hyphen does not really address the problems.

In Him,
	DM

Abed–nego
Abel–beth–maachah
Abel–maim
Abel–meholah
Abel–mizraim
Abel–shittim
Abi–albon
Abi–ezer
Abi–ezrite
Adoni–bezek
Adoni–zedek
Allon–bachuth
Almon–diblathaim
Ashdoth–pisgah
Ataroth–adar
Ataroth–addar
Aznoth–tabor
Baalath–beer
Baal–berith
Baal–gad
Baal–hamon
Baal–hanan
Baal–hazor
Baal–hermon
Baal–meon
Baal–peor
Baal–perazim
Baal–shalisha
Baal–tamar
Baal–zebub
Baal–zephon
Bamoth–baal
Bashan–havoth–jair
Bath–rabbim
Bath–sheba
Bath–shua
Beer–elim
Beer–lahai–roi
Beer–sheba
Beesh–terah
Ben–ammi
Bene–berak
Bene–jaakan
Ben–hadad
Ben–hail
Ben–hanan
Ben–oni
Ben–zoheth
Berodach–baladan
Beth–anath
Beth–anoth
Beth–arabah
Beth–aram
Beth–arbel
Beth–aven
Beth–azmaveth
Beth–baal–meon
Beth–barah
Beth–birei
Beth–car
Beth–dagon
Beth–diblathaim
Beth–el
Beth–emek
Beth–ezel
Beth–gader
Beth–gamul
Beth–haccerem
Beth–haran
Beth–hoglah
Beth–hogla
Beth–horon
Beth–jeshimoth
Beth–jesimoth
Beth–lebaoth
Beth–lehem–judah
Beth–lehem
Beth–maachah
Beth–marcaboth
Beth–meon
Beth–nimrah
Beth–palet
Beth–pazzez
Beth–peor
Beth–phelet
Beth–rapha
Beth–rehob
Beth–shan
Beth–shean
Beth–shemesh
Beth–shemite
Beth–shittah
Beth–tappuah
Beth–zur
Caleb–ephratah
Chephar–haammonai
Chisloth–tabor
Chor–ashan
Chushan–rishathaim
Col–hozeh
Dan–jaan
Dibon–gad
Ebed–melech
Eben–ezer
El–beth–el
El–elohe–Israel
El–elohe–Israel
Elon–beth–hanan
El–paran
En–eglaim
En–gannim
En–gedi
En–haddah
En–hakkore
En–hazor
En–mishpat
En–rimmon
En–rogel
En–shemesh
En–tappuah
Ephes–dammim
Esar–haddon
Esh–baal
Evil–merodach
Ezion–gaber
Ezion–geber
Gath–hepher
Gath–rimmon
Gibeah–haaraloth
Gittah–hepher
Gur–baal
Hamath–zobah
Hammoth–dor
Hamon–gog
Havoth–jair
Hazar–addar
Hazar–enan
Hazar–gaddah
Hazar–hatticon
Hazar–maveth
Hazar–shual
Hazar–susah
Hazar–susim
Hazazon–tamar
Hazezon–tamar
Helkath–hazzurim
Hephzi–bah
Hor–hagidgad
I–chabod
Ije–abarim
Ir–nahash
Ir–shemesh
Ishbi–benob
Ish–bosheth
Ish–tob
Ittah–kazin
Jaare–oregim
Jabesh–gilead
Jashubi–lehem
Jegar–sahadutha
Jehovah–jireh
Jehovah–nissi
Jehovah–shalom
Jiphthah–el
Jushab–hesed
Kadesh–barnea
Kedesh–naphtali
Keren–happuch
Kibroth–hattaavah
Kir–haraseth
Kir–hareseth
Kir–haresh
KirhereKir–heres
Kirjath–arba
Kirjath–arim
Kirjath–baal
Kirjath–huzoth
Kirjath–jearim
Kirjath–sannah
Kirjath–sepher
Lahai–roi
Lo–ammi
Lo–debar
Lo–ruhamah
Maaleh–acrabbim
Magor–missabib
Mahaneh–dan
Maher–shalal–hash–baz
Malchi–shua
Me–jarkon
Melchi–shua
Meribah–Kadesh
Merib–baal
Merodach–baladan
Metheg–ammah
Migdal–el
Migdal–gad
Misrephoth–maim
Moresheth–gath
Nathan–melech
Nebuzar–adan
Nergal–sharezer
Obed–edom
Padan–aram
Pahath–moab
Pas–dammim
Perez–uzzah
Perez–uzza
Pharaoh–hophra
Pharaoh–nechoh
Pharaoh–necho
Pi–beseth
Pi–hahiroth
Poti–pherah
RabsariRab–saris
Rab–shakeh
Ramathaim–zophim
Ramath–lehi
Ramath–mizpeh
Ramoth–gilead
Regem–melech
Remmon–methoar
Rimmon–parez
Romamti–ezer
Ru–hamah
Samgar–nebo
Sela–hammahlekoth
Shear–jashub
Shethar–boznai
Shihor–libnath
Shimron–meron
Succoth–benoth
Syria–damascus
Syria–maachah
Taanath–shiloh
Tahtim–hodshi
Tel–abib
Tel–haresha
Tel–harsa
Tel–melah
Tiglath–pileser
Tilgath–pilneser
Timnath–heres
Timnath–serah
Tob–adonijah
Tubal–cain
Uzzen–sherah
Zareth–shahar
Zaphnath–paaneah


On Mar 2, 2013, at 6:01 AM, Chris Burrell <chris at burrell.me.uk> wrote:

> Can't this be done with a simple filter, i.e. always change the '-' to one kind regardless of the length. And when the user input comes in, do the same.
> Chris
> 
> 
> On 2 March 2013 02:36, Nic Carter <niccarter at mac.com> wrote:
> 
> Do you have a proposed solution to this, David?
> 
> I know that on my iPhone it is very simple to use a proper ndash & so I will always use the correct type of dash according to what I am writing. (same with on a Mac!)
> However, the more significant issue is simply that people don't know there is a difference (or why they are different lengths, etc)...  ;)
> 
> On 25/02/2013, at 2:48 AM, David Haslam <dfhmch at googlemail.com> wrote:
> 
> > In the KJV module, if you want to search for [say] the hyphenated name
> > "Maher–shalal–hash–baz", you first have to be aware that this module uses
> > the ndash in place of the hyphen.
> >
> > btw.  It's not so easy to enter the ndash from a keyboard, and probably even
> > harder in an Android tablet or mobile.
> >
> > If you use ordinary hyphen/minus for the search key hyphen for this module,
> > you don't find anything with "Exact phrase".
> > If you use "Multi-word", you do find "Maher" highlighted in the found verse.
> > (e.g. using Xiphos).
> >
> > For modules in general, however, the user cannot usually know in advance
> > whether hyphenated words use the ndash, the hyphen or something else.
> >
> > Has anyone else looked into this aspect of the search feature?
> >
> > David
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://sword-dev.350566.n4.nabble.com/Searching-for-hyphenated-words-tp4652016.html
> > Sent from the SWORD Dev mailing list archive at Nabble.com.
> >
> > _______________________________________________
> > sword-devel mailing list: sword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/sword-devel
> > Instructions to unsubscribe/change your settings at above page
> 
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20130302/03ab4e2b/attachment-0001.html>


More information about the sword-devel mailing list