[sword-devel] Comming soon: new improved sword searching
Sun, 8 Sep 2002 21:31:44 -0700 (MST)
On Sun, 8 Sep 2002, Joel Mawhorter wrote:
> Wouldn't it make more sense to use UTF-16 than UTF-8 in regular expressions.
> At least with UTF-16, in most cases, 1 character == 1 symbol so regular
> expressions would be more managable (e.g. what does a dot mean in a regular
> expression when being matched against symbols that can be represented in 1,2
> or 3 chars?). Does ICU have regular expression support? I know the regular
> expression support in Java 1.4 is very nice and uses UTF-16 but alas we can't
> really use that in Sword unless we come up with a CNNI (C non-native
> interface :-).
Nope. Sword is entirely UTF-8 internally. Perl just happens to be the
same. Perl has a nice regex implementation built on UTF-8. In Perl, a
dot means a character. Regexes should operate on characters, not bytes,
after all. No, ICU doesn't have any regex support. It's almost entirely
devoted to i18n/l10n stuff, though it does have a simple io library.
Using code that works with UTF-8 also benefits us by not requiring that we
convert to/from UTF-16.