[Wikitext-l] MediaWiki parser implementation

Wed Aug 4 02:21:36 UTC 2010

Hello, Andreas,

I am interesting with your project.

But I can not download the source, could you send it to me via mail
(mingli.yuan AT gmail.com)

Thanks a lot.

Regards,
Mingli

On Wed, Aug 4, 2010 at 6:10 AM, Andreas Jonsson
<andreas.jonsson at kreablo.se>wrote:

> Hello,
>
> I am initiating yet another attempt at writing a new parser for
> MediaWiki.  It seems that more than six month have passed since the
> last attempt, so it's about time. :)
>
> Parser functions, magic words and html comments are better handled by
> a preprocessor than trying to integrate them with the parser (at least
> if you want preserve the current behavior).  So I am only aiming at
> implementing something that can be plugged in after the preprocessing
> stages.
>
> In the wikimodel project (http://code.google.com/p/wikimodel/) we are
> using a parser design that works well for wiki syntax; a front end
> (implemented using an LL-parser generator) scans the text and feeds
> events to a context object, which can be queried by the front end to
> enable context sensitive parsing.  The context object will in turn
> feed a well formed sequence of events to a listener that may build a
> tree structure, generate xml, or any other format.
>
> As of parser generators, Antlr seems to be the best choice. It have
> support for semantic predicates and rather sophisticated options for
> backtracking.  I'm peeking at Steve Bennet's antlr grammar
> (http://www.mediawiki.org/wiki/Markup_spec/ANTLR), but I cannot really
> use that one, since the parsing algorothm is fundamentally different.
>
> There are two problems with Antlr:
>
> 1. No php back-end
>
>    Writing a php back-end to antlr is a matter of providing a set of
>    templates and porting the runtime.  It's a lot of work, but seems
>    fairly straightforward.
>
>    The parser can, of course, be written in C and be deployed as a php
>    extension.  The drawback is that it will be harder to deploy it,
>    while the advantage is the performance.  For MediaWiki it might be
>    worth to maintain both a php and a C version though, since both
>    speed and deployability are important.
>
> 2. No UTF-8 support in the C runtime in the latest release of antlr.
>
>    In trunk it has support of various character encodings,though, so
>    it will probably be there in the next release.
>
> My implementation is just at the beginning stages, but I have
> successfully reproduced the exact behavior of MediaWiki's parsing of
> apostrophes, which seems to be by far the hardest part. :)
>
> I put it up right here if anyone is interested at looking at it:
>
>
> http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.tar.gz
>
>
> Best regards,
>
> Andreas Jonsson
>
> _______________________________________________
> Wikitext-l mailing list
> Wikitext-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/wikitext-l/attachments/20100804/9d5f0921/attachment.htm