Very interesting!

I'm not sure his "hard parts" are really the hardest parts; but I don't know enough about the MW parser to be sure.

I do hope the parser can be replaced by a C parser!

Asaf

On Thu, Sep 23, 2010 at 2:44 PM, Manuel Schneider <manuel.schneider@wikimedia.ch> wrote:

Someone created a MediaWiki parser written in C - please see the mail below.

Greetings from Linux-Kongress in Nürnberg,

/Manuel

Sent via mobile phone.

-- Urspr. Mitt. --
Betreff: [Wikitech-l] Parser implementaton for MediaWiki syntax
Von: Andreas Jonsson <andreas.jonsson@kreablo.se>
Datum: 23.09.2010 11:28

Hi,

I have written a parser for MediaWiki syntax and have set up a test
site for it here:

http://libmwparser.kreablo.se/index.php/Libmwparsertest

and the source code is available here:

http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser

A preprocessor will take care of parser functions, magic words,
comment removal, and transclusion. But as it wasn't possible to
cleanly separate these functions from the existing preprocessor, some
preprocessing is disabled at the test site. It should be
straightforward to write a new preprocessor that provides only the required
functionality, however.

The parser is not feature complete, but the hard parts are solved. I
consider "the hard parts" to be:

* parsing apostrophes
* parsing html mixed with wikitext
* parsing headings and links
* parsing image links

And when I say "solved" I mean producing the same or equivalent output
as the original parser, as long as the behavior of the original parser
is well defined and produces valid html.

Here is a schematic overview of the design:

+-----------------------+
| | Wikitext
| client application +---------------------------------------+
| | |
+-----------------------+ |
^ |
| Event stream |
+----------+------------+ +-------------------------+ |
| | | | |
| parser context |<------>| Parser | |
| | | | |
+-----------------------+ +-------------------------+ |
^ |
| Token stream |
+-----------------------+ +------------+------------+ |
| | | | |
| lexer context |<------>| Lexer |<---+
| | | |
+-----------------------+ +-------------------------+

The design is described more in detail in a series of posts at the
wikitext-l mailing list. The most important "trick" is to make sure
that the lexer never produce a spurious token. An end token for a
production will not appear unless the corresponding begin token
already has been produced, and the lexer maintains a block context to
only produce tokens that makes sense in the current block.

I have used Antlr for generating both the parser and the lexer, as
Antlr supports semantic predicates that can be used for context
sensitive parsing. Also I am using a slightly patched version of
Antlr's C runtime environent, because the lexer needs to support
speculative execution in order to do context sensitive lookahead.

A Swig generated interface is used for providing the php api. The
parser process the buffer of the php string directly, and writes its
output to an array of php strings. Only UTF-8 is supported at the
moment.

The performance seems to be about the same as for the original parser
on plain text. But with an increasing amount of markup, the original
parser runs slower. This new parser implementation maintains roughly
the same performance regardless of input.

I think that this demonstrates the feasability of replacing the
MediaWiki parser. There is still a lot of work to do in order to turn
it into a full replacement, however.

Best regards,

Andreas

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
dev-l mailing list
dev-l@openzim.org
https://intern.openzim.org/mailman/listinfo/dev-l

--
Asaf Bartov <asaf.bartov@gmail.com>