Steve Sanbeg wrote:
IIRC, accept means that if the language is tokenized
correctly, it can
give a yes/no whether the input stream is valid. I don't think this helps
much when trying to tokenize it to begin with.
That can only happen at a level above tokenization, i.e. parsing. To
take a C example, "! &&;" is a perfectly legal set of tokens, but
clearly not in the language. Also, as noted elsewhere, wikitext is
basically the set of all strings, since we don't want to generate
"compilation errors".
Wouldn't regexes always be compiled to FSMs,
regardless of language or
constructs?
Not FSMs, no. Perl-style regexes can do things that no FSM can do. For
example, since FSMs are memoryless, they can't include back-references.
I imagine they are compiled to something, but I couldn't say what. My
argument was that PHP is probably smart enough to recognize regexes
which don't include these extra features, and compile them to FSMs,
since an FSM is such an efficient implementation.
Anyway this is getting off topic, since the discussion was over whether
an FSM is adequate to tokenize wikitext. I don't think this question has
been answered yet, but if the answer is yes then even a true regex
(Kleene-style) is also powerful enough. So that's fine.
Soo Reams