On 11/9/07, Steve Sanbeg ssanbeg@ask.com wrote:
But some constructs in MW require an FSM to tokenize, not a regex. Clearly, properly tokenizing bold/italics requires complex processing on an entire paragraph of text. Even templates and links are a little complex, but should be doable by maintaining states with a stack.
FSMs accept regular languages by definition, so the set of things an FSM can recognize is precisely equal to that which can be specified by a regex. :) Regardless, I take your point, and don't know enough about the subject matter to address it. I can look at the flexbisonparse lexer and parser:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/wikilex.l?rev... http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/wikiparse.y?r...
but I can't really understand what it does, or whether it works properly, at least not without figuring out how to install the thing and run the parser tests on it.