[Wikitech-l] Parser implementaton for MediaWiki syntax

23 Sep 2010


      Hi,
I have written a parser for MediaWiki syntax and have set up a test
site for it here:
http://libmwparser.kreablo.se/index.php/Libmwparsertest
and the source code is available here:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
A preprocessor will take care of parser functions, magic words,
comment removal, and transclusion.  But as it wasn't possible to
cleanly separate these functions from the existing preprocessor, some
preprocessing is disabled at the test site.  It should be
straightforward to write a new preprocessor that provides only the required
functionality, however.
The parser is not feature complete, but the hard parts are solved.  I
consider "the hard parts" to be:
* parsing apostrophes
* parsing html mixed with wikitext
* parsing headings and links
* parsing image links
And when I say "solved" I mean producing the same or equivalent output
as the original parser, as long as the behavior of the original parser
is well defined and produces valid html.
Here is a schematic overview of the design:
+-----------------------+
|                       |              Wikitext
|  client application   +---------------------------------------+
|                       |                                       |
+-----------------------+                                       |
            ^                                                    |
            | Event stream                                       |
+----------+------------+        +-------------------------+    |
|                       |        |                         |    |
|    parser context     |<------>|         Parser          |    |
|                       |        |                         |    |
+-----------------------+        +-------------------------+    |
                                               ^                 |
                                               | Token stream    |
+-----------------------+        +------------+------------+    |
|                       |        |                         |    |
|    lexer context      |<------>|         Lexer           |<---+
|                       |        |                         |
+-----------------------+        +-------------------------+
The design is described more in detail in a series of posts at the
wikitext-l mailing list.  The most important "trick" is to make sure
that the lexer never produce a spurious token.  An end token for a
production will not appear unless the corresponding begin token
already has been produced, and the lexer maintains a block context to
only produce tokens that makes sense in the current block.
I have used Antlr for generating both the parser and the lexer, as
Antlr supports semantic predicates that can be used for context
sensitive parsing.  Also I am using a slightly patched version of
Antlr's C runtime environent, because the lexer needs to support
speculative execution in order to do context sensitive lookahead.
A Swig generated interface is used for providing the php api.  The
parser process the buffer of the php string directly, and writes its
output to an array of php strings.  Only UTF-8 is supported at the
moment.
The performance seems to be about the same as for the original parser
on plain text.  But with an increasing amount of markup, the original
parser runs slower.  This new parser implementation maintains roughly
the same performance regardless of input.
I think that this demonstrates the feasability of replacing the
MediaWiki parser.  There is still a lot of work to do in order to turn
it into a full replacement, however.
Best regards,
Andreas

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Parser implementaton for MediaWiki syntax