On 11/9/07, Steve Sanbeg ssanbeg@ask.com wrote:
Wouldn't regexes always be compiled to FSMs, regardless of language or constructs?
Not if they include back-references, assertions, or other things not part of "real" regexes. A real regex can be defined as follows:
1) Any character is a regex.
2) Any regex followed by a * is a regex.
3) Any two regexes concatenated together are regexes.
4) Any two regexes concatenated together with a | character are regexes.
Stuff like [a-z] is just shorthand for (a|b|c|...|z), so that's still a proper regex, but things like back-references aren't possible to translate into those four rules, so you have to use something more sophisticated than a FSM.