Hello,
I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)
Parser functions, magic words and html comments are better handled by a preprocessor than trying to integrate them with the parser (at least if you want preserve the current behavior). So I am only aiming at implementing something that can be plugged in after the preprocessing stages.
In the wikimodel project (http://code.google.com/p/wikimodel/) we are using a parser design that works well for wiki syntax; a front end (implemented using an LL-parser generator) scans the text and feeds events to a context object, which can be queried by the front end to enable context sensitive parsing. The context object will in turn feed a well formed sequence of events to a listener that may build a tree structure, generate xml, or any other format.
As of parser generators, Antlr seems to be the best choice. It have support for semantic predicates and rather sophisticated options for backtracking. I'm peeking at Steve Bennet's antlr grammar (http://www.mediawiki.org/wiki/Markup_spec/ANTLR), but I cannot really use that one, since the parsing algorothm is fundamentally different.
There are two problems with Antlr:
1. No php back-end
Writing a php back-end to antlr is a matter of providing a set of templates and porting the runtime. It's a lot of work, but seems fairly straightforward.
The parser can, of course, be written in C and be deployed as a php extension. The drawback is that it will be harder to deploy it, while the advantage is the performance. For MediaWiki it might be worth to maintain both a php and a C version though, since both speed and deployability are important.
2. No UTF-8 support in the C runtime in the latest release of antlr.
In trunk it has support of various character encodings,though, so it will probably be there in the next release.
My implementation is just at the beginning stages, but I have successfully reproduced the exact behavior of MediaWiki's parsing of apostrophes, which seems to be by far the hardest part. :)
I put it up right here if anyone is interested at looking at it:
http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.tar...
Best regards,
Andreas Jonsson