[sword-devel] Character Frequency

Peter von Kaehne refdoc at gmx.net
Thu Jul 7 14:38:21 MST 2011


On 03/07/11 18:43, Greg Hellings wrote:

> http://dl.thehellings.com/count.py

What one though really needs (an all solutions mentioned so far lack)
is a character counter which disregards OSIS tags and attributes.

A "c" in a text of a cyrillic Bible can either be perfectly innocent (as
part of e.g. the "chapter" tag) or it might be in place of a "с"
(\u0441), in which case it causes a mess.

Similar about numbers - a common problem in Arabic script texts we
receive is that the references in xrefs are in Western numbers. Again,
such numbers are normal part of OSIS attributes

I have just now committed a couple of scripts to sword-tools to assist
with this:

1) charmap.pl takes a OSIS file (or rather any XML file) and returns a
character map similar to thise discussed, but solely for text nodes

2) osis_tr.pl does a "tr" job - replacing one set of characters with
another, but again only in text nodes

3) numbers.pl fixes the numbers problem above. I wrote this first,
before I generalised it into the osis_tr.pl script, but think it has
value, as the problem is so common.

Peter



More information about the sword-devel mailing list