[sword-devel] Character Frequency
Peter von Kaehne
refdoc at gmx.net
Thu Jul 7 14:38:21 MST 2011
On 03/07/11 18:43, Greg Hellings wrote:
What one though really needs (an all solutions mentioned so far lack)
is a character counter which disregards OSIS tags and attributes.
A "c" in a text of a cyrillic Bible can either be perfectly innocent (as
part of e.g. the "chapter" tag) or it might be in place of a "с"
(\u0441), in which case it causes a mess.
Similar about numbers - a common problem in Arabic script texts we
receive is that the references in xrefs are in Western numbers. Again,
such numbers are normal part of OSIS attributes
I have just now committed a couple of scripts to sword-tools to assist
1) charmap.pl takes a OSIS file (or rather any XML file) and returns a
character map similar to thise discussed, but solely for text nodes
2) osis_tr.pl does a "tr" job - replacing one set of characters with
another, but again only in text nodes
3) numbers.pl fixes the numbers problem above. I wrote this first,
before I generalised it into the osis_tr.pl script, but think it has
value, as the problem is so common.
More information about the sword-devel