On Fri, 09 Nov 2007 19:23:44 +0000, Soo Reams wrote:
Simetrical wrote:
On 11/9/07, Steve Sanbeg ssanbeg@ask.com wrote:
But some constructs in MW require an FSM to tokenize, not a regex. Clearly, properly tokenizing bold/italics requires complex processing on an entire paragraph of text. Even templates and links are a little complex, but should be doable by maintaining states with a stack.
FSMs accept regular languages by definition, so the set of things an FSM can recognize is precisely equal to that which can be specified by a regex. :)
In fact regexes as seen in PHP etc are more powerful than FSMs, since they can include back references and suchlike. But I presume PHP compiles regexes down to efficient FSMs if they don't include such constructs, so it probably doesn't make much difference in performance terms.
Soo Reams
IIRC, accept means that if the language is tokenized correctly, it can give a yes/no whether the input stream is valid. I don't think this helps much when trying to tokenize it to begin with.
Wouldn't regexes always be compiled to FSMs, regardless of language or constructs?