[sword-devel] Comming soon: new improved sword searching

Mon, 9 Sep 2002 14:30:45 -0400

On Monday 09 September 2002 12:27 pm, Matthew Donadio wrote:
> Lamar Owen wrote:
> > Of course, if the existing one is DFA, and the replacement is NFA,
> > then some regexes may break in subtle ways.... See the O'Reilly regex
> > book for details on how deterministic finite automatons and
> > non-deterministic finite automatons differ from the point of view of
> > crafting regexes.

> Can you elaborate on this?  I know the theory behind this, and am a bit
> confused by the statment.  It is pretty straightforward to convert a DFA
> into an RE.  The textbook method for converting an RE into a DFA is RE
> -> NFA+e moves -> NFA -> DFA.  You then can perform minimization of the
> DFA to get the optimum one.

But that's on the engine side.  I'm referring to the difference between 
crafting the regex itself. Things such as 'Is alternation greedy' are 
dependent upon whether the regex engine is DFA, traditional NFA, or POSIX 
NFA.

Plus, the exact form of the regex matters more with an NFA engine, thanks to 
its regex-directed nature -- you have 'wiggle' room in crafting your regexes.  
To the text-driven DFA, the exact from of the regex isn't nearly as 
important.

The presence of backreferences pretty well tags the engine as being NFA. The 
Perl engine is decidedly NFA.  Some might even say it is irregular....

When searching Bible phrases, the difference between the NFA's  matching 
behavior and the DFA's 'longest leftmost' behavior might cause confusion in 
search results.  That is, the same regex fed to an NFA might find a different 
set of matches than the same regex fed to a DFA engine.  Of course, the POSIX 
NFA matches longest leftmost, too, further confusing the issue.

For simple searching, where backreferences aren't important, a DFA might be a 
better way to go.  AFAIK, a DFA engine can be found in original awk, 
Kernighan's nawk, and a few versions of GNU awk, as well as Aho's original 
egrep -- oddly enough, original grep was an NFA.  But GNU grep is mostly DFA.

So, is the existing GNU regex library DFA or NFA?  (I don't know -- haven't 
looked at the code).  If DFA, we would be subbing the Perl engine, which, due 
to the feature set Perl needs, is NFA.  

Summary: For the DFA, regex 'crafting' isn't so important -- but for an NFA 
proper regex crafting can mean substantial speed differences.

For more information, read 'Mastering Regular Expressions' by Jeffrey E F 
Friedl from O'Reilly.

And I know my usage of these terms (as well as Mr. Friedl's) is somewhat 
irregular, since the NFA and DFA concepts are used for much more than regular 
expressions.
-- 
Lamar Owen
WGCR Internet Radio
1 Peter 4:11