Hi,
I have written a parser for MediaWiki syntax and have set up a test site for it here:
http://libmwparser.kreablo.se/index.php/Libmwparsertest
and the source code is available here:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
A preprocessor will take care of parser functions, magic words, comment removal, and transclusion. But as it wasn't possible to cleanly separate these functions from the existing preprocessor, some preprocessing is disabled at the test site. It should be straightforward to write a new preprocessor that provides only the required functionality, however.
The parser is not feature complete, but the hard parts are solved. I consider "the hard parts" to be:
* parsing apostrophes * parsing html mixed with wikitext * parsing headings and links * parsing image links
And when I say "solved" I mean producing the same or equivalent output as the original parser, as long as the behavior of the original parser is well defined and produces valid html.
Here is a schematic overview of the design:
+-----------------------+ | | Wikitext | client application +---------------------------------------+ | | | +-----------------------+ | ^ | | Event stream | +----------+------------+ +-------------------------+ | | | | | | | parser context |<------>| Parser | | | | | | | +-----------------------+ +-------------------------+ | ^ | | Token stream | +-----------------------+ +------------+------------+ | | | | | | | lexer context |<------>| Lexer |<---+ | | | | +-----------------------+ +-------------------------+
The design is described more in detail in a series of posts at the wikitext-l mailing list. The most important "trick" is to make sure that the lexer never produce a spurious token. An end token for a production will not appear unless the corresponding begin token already has been produced, and the lexer maintains a block context to only produce tokens that makes sense in the current block.
I have used Antlr for generating both the parser and the lexer, as Antlr supports semantic predicates that can be used for context sensitive parsing. Also I am using a slightly patched version of Antlr's C runtime environent, because the lexer needs to support speculative execution in order to do context sensitive lookahead.
A Swig generated interface is used for providing the php api. The parser process the buffer of the php string directly, and writes its output to an array of php strings. Only UTF-8 is supported at the moment.
The performance seems to be about the same as for the original parser on plain text. But with an increasing amount of markup, the original parser runs slower. This new parser implementation maintains roughly the same performance regardless of input.
I think that this demonstrates the feasability of replacing the MediaWiki parser. There is still a lot of work to do in order to turn it into a full replacement, however.
Best regards,
Andreas