Steve Sanbeg wrote:
IIRC, accept means that if the language is tokenized correctly, it can give a yes/no whether the input stream is valid. I don't think this helps much when trying to tokenize it to begin with.
That can only happen at a level above tokenization, i.e. parsing. To take a C example, "! &&;" is a perfectly legal set of tokens, but clearly not in the language. Also, as noted elsewhere, wikitext is basically the set of all strings, since we don't want to generate "compilation errors".
Wouldn't regexes always be compiled to FSMs, regardless of language or constructs?
Not FSMs, no. Perl-style regexes can do things that no FSM can do. For example, since FSMs are memoryless, they can't include back-references. I imagine they are compiled to something, but I couldn't say what. My argument was that PHP is probably smart enough to recognize regexes which don't include these extra features, and compile them to FSMs, since an FSM is such an efficient implementation.
Anyway this is getting off topic, since the discussion was over whether an FSM is adequate to tokenize wikitext. I don't think this question has been answered yet, but if the answer is yes then even a true regex (Kleene-style) is also powerful enough. So that's fine.
Soo Reams