[Wikitext-l] MediaWiki parser implementation

Andreas Jonsson andreas.jonsson at kreablo.se
Tue Aug 3 22:10:41 UTC 2010


I am initiating yet another attempt at writing a new parser for
MediaWiki.  It seems that more than six month have passed since the
last attempt, so it's about time. :)

Parser functions, magic words and html comments are better handled by
a preprocessor than trying to integrate them with the parser (at least
if you want preserve the current behavior).  So I am only aiming at
implementing something that can be plugged in after the preprocessing

In the wikimodel project (http://code.google.com/p/wikimodel/) we are
using a parser design that works well for wiki syntax; a front end
(implemented using an LL-parser generator) scans the text and feeds
events to a context object, which can be queried by the front end to
enable context sensitive parsing.  The context object will in turn
feed a well formed sequence of events to a listener that may build a
tree structure, generate xml, or any other format.

As of parser generators, Antlr seems to be the best choice. It have
support for semantic predicates and rather sophisticated options for
backtracking.  I'm peeking at Steve Bennet's antlr grammar
(http://www.mediawiki.org/wiki/Markup_spec/ANTLR), but I cannot really
use that one, since the parsing algorothm is fundamentally different.

There are two problems with Antlr:

1. No php back-end

    Writing a php back-end to antlr is a matter of providing a set of
    templates and porting the runtime.  It's a lot of work, but seems
    fairly straightforward.

    The parser can, of course, be written in C and be deployed as a php
    extension.  The drawback is that it will be harder to deploy it,
    while the advantage is the performance.  For MediaWiki it might be
    worth to maintain both a php and a C version though, since both
    speed and deployability are important.

2. No UTF-8 support in the C runtime in the latest release of antlr.

    In trunk it has support of various character encodings,though, so
    it will probably be there in the next release.

My implementation is just at the beginning stages, but I have
successfully reproduced the exact behavior of MediaWiki's parsing of
apostrophes, which seems to be by far the hardest part. :)

I put it up right here if anyone is interested at looking at it:


Best regards,

Andreas Jonsson

