[sword-devel] Entities in modules

DM Smith dmsmith at crosswire.org
Wed Nov 11 13:50:12 MST 2009


We have a few modules that have entities in them. These are of the fashion   (a character entity), U (a numeric decimal entity) and Å (a numeric hex entity).

These cause various problems:
a) If a module is encoded in Latin-1, there may be entities that do not fall within that encoding. In a HTML viewer, which does substitutions, the resultant text may have mixed latin-1 and UTF-8, causing display problems.

b) If a module is searched, then these will cause search problems. For example, if one is searching for Bokmål and the text is encoded with Bokmål, it won't be found. When indexing with clucene, it will be broken into three words Bokm, aring and l. Searching for aring will find it as a word.

c) Transliteration won't work on words with entities.

d) Removing decorations (umlauts, rings, accents, ....) on words won't work.

e) It is legal to have numeric entities for &, <, >, " and ', but SWORD has no recognition of these.

And so forth.

When we create a module, we should make sure to replace entities with their UTF-8 equivalent. (of course making sure that the text is UTF-8 first).

To that end, I have written a Perl utility, EntityReplacer, that will normalize the entities for <, >, &, " and ', and replace most other entities (about 2700) with their UTF-8 equivalents.

You can get the code here:
www.crosswire.org/~dmsmith/perl

Like Chris' perl code, I have put it under the BSD license and copyrighted to CrossWire.

It is packaged for CPAN, so you can install it in the usual way:
perl Makefile.PL
make
make test
make install

Or you can grab the EntityReplacer.pm and put it in the same folder that you have a program and call it in the following fashion:
#!/usr/bin/perl -w
use strict;

use FindBin qw($Bin);
use lib "$Bin";

use EntityReplacer;

binmode(STDOUT, ":utf8");

# Read the input, one line at a time, replacing on each line all entities, except ones for <, >, &, ' and ".
while (<>) {
        s/(\&#?[a-zA-Z0-9-]+;)/EntityReplacer::toReplacement($1)/geo;
        print STDOUT;
}

Hope you find it useful.

In His Service,
	DM


More information about the sword-devel mailing list