MediaWiki parser in Python

List overview All Threads
Download

newer

older

Cunningham's exploratory parsing

progress on editor

Peter17

30 Jun 2011 30 Jun '11

1:14 a.m.

Dear all,

I have recently subscribed to this list and I wanted to introduce myself.

I have been working as a student on the 2011 edition of the Google Summer of Code on a MediaWiki parser [1] for the Mozilla Foundation. My mentor is Erik Rose.

For this purpose, we use a Python PEG parser called Pijnu [2] and implement a grammar for it [3]. This way, we parse the wikitext into an abstract syntax tree that we will then transform to HTML or other formats.

One of the advantages of Pijnu is the simplicity and readability of the grammar definition [3]. It is not finished yet, but what we have done so far seems very promising.

Please don't hesitate to give advice of feedback, or even test it if you wish!

Best regards

[1] https://github.com/peter17/mediawiki-parser [2] https://github.com/peter17/pijnu [3] https://github.com/peter17/mediawiki-parser/blob/master/mediawiki.pijnu

-- Peter Potrowl

Show replies by date

Brion Vibber

30 Jun 30 Jun

1:48 a.m.

On Wed, Jun 29, 2011 at 4:14 PM, Peter17 peter017@gmail.com wrote:

...

I have been working as a student on the 2011 edition of the Google Summer of Code on a MediaWiki parser [1] for the Mozilla Foundation. My mentor is Erik Rose.

For this purpose, we use a Python PEG parser called Pijnu [2] and implement a grammar for it [3]. This way, we parse the wikitext into an abstract syntax tree that we will then transform to HTML or other formats.

One of the advantages of Pijnu is the simplicity and readability of the grammar definition [3]. It is not finished yet, but what we have done so far seems very promising.

Neat! Your life is definitely made easier by skipping full compatibility with some of our freakier syntax oddities ;) which'll still be very handy for various embedded-style "lite wiki" usages.

Great list of alternatives, libraries & algorithms in your notes too though obviously mostly Python-oriented; looks like you've already looked at PediaPress's mwlib library, which is also Python-based. It's definitely a bit... hairier due to having to handle more of our funky syntax (it drives the PDF download and print-on-demand system on Wikipedia).

I'm still looking around for good parser generator tools for PHP (we've been fiddling with PEG.js in some of our JavaScript-side experiments so far but will eventually need both JS and PHP implementations to cover editing tools and actual back-end rendering), so if anybody stumbles on good existing ones give a shout or we may have to roll some our own.

Bonus points if we can eventually share the formal grammar production rules between multiple language implementations. :)

-- brion

Neil Kandalgaonkar

5:08 a.m.

I think we should just generate PHP and JS from whatever powers the ParserPlayground. Currently that's PEG.js, and the JS it generates actually isn't very JS-like at all, it's more like a C program anyway, so it's readily portable to PHP.

(Actually we might want to modify it so that it generates more concise JS code; I found I could shrink it by about 70% with some hand-applied transformations.)

That said there are varying PEG syntaxes[1] and we may find that PEG.js isn't the best of them.

[1] don't make me say "syntactes"

On 6/29/11 4:48 PM, Brion Vibber wrote:

...

On Wed, Jun 29, 2011 at 4:14 PM, Peter17 <peter017@gmail.com mailto:peter017@gmail.com> wrote:
I have been working as a student on the 2011 edition of the Google
Summer of Code on a MediaWiki parser [1] for the Mozilla Foundation.
My mentor is Erik Rose.

For this purpose, we use a Python PEG parser called Pijnu [2] and
implement a grammar for it [3]. This way, we parse the wikitext into
an abstract syntax tree that we will then transform to HTML or other
formats.

One of the advantages of Pijnu is the simplicity and readability of
the grammar definition [3]. It is not finished yet, but what we have
done so far seems very promising.
Neat! Your life is definitely made easier by skipping full compatibility with some of our freakier syntax oddities ;) which'll still be very handy for various embedded-style "lite wiki" usages.

Great list of alternatives, libraries & algorithms in your notes too though obviously mostly Python-oriented; looks like you've already looked at PediaPress's mwlib library, which is also Python-based. It's definitely a bit... hairier due to having to handle more of our funky syntax (it drives the PDF download and print-on-demand system on Wikipedia).

I'm still looking around for good parser generator tools for PHP (we've been fiddling with PEG.js in some of our JavaScript-side experiments so far but will eventually need both JS and PHP implementations to cover editing tools and actual back-end rendering), so if anybody stumbles on good existing ones give a shout or we may have to roll some our own.

Bonus points if we can eventually share the formal grammar production rules between multiple language implementations. :)

-- brion

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- Neil Kandalgaonkar neilk@wikimedia.org

Erik Rose

12 Jul 12 Jul

1:45 a.m.

...

I have recently subscribed to this list and I wanted to introduce myself.

I have been working as a student on the 2011 edition of the Google Summer of Code on a MediaWiki parser [1] for the Mozilla Foundation. My mentor is Erik Rose.

...which probably means I should introduce myself as well. :-)

Hi! I'm Erik Rose, and I work on support.mozilla.com, where we keep thousands of support articles in a variant of MediaWiki syntax. Even outside Wikipedia itself (though no doubt driven by it), MW syntax has such a huge mindshare that our volunteers pretty much demanded it. At the moment, we use basically a straight port of the PHP to Python and then build painfully Byzantine layers around it to implement some custom syntax. Our summer project is to simplify this mess by building the most comprehensible, extensible MW parser available for Python:

* You'll be able to plug your own custom syntax bits into it without messing with the code. * You can get the raw AST if you like. Or you can pass in transformation functions to customize the output of various nodes. * We'll also provide hooks so you can do whatever you want with MW "product" features like includes, templates, and such (as opposed to pure "language" features).

As Peter already mentioned, our project's home is https://github.com/erikrose/mediawiki-parser. Or you might look at Peter's fork. Sometimes his is more up-to-date, sometimes mine.

We have most of the productions working now. Peter's working on templates at the moment, which are probably going to involve a pre-parsing phase, and then it's on to apostrophes, which I'm hoping we can rip off other people's work for. :-)

It's great to see other folks thinking about the language. I'm sure we'll talk soon!

Erik Rose

4927

Age (days ago)

4939

Last active (days ago)

wikitext-l@lists.wikimedia.org

3 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Erik Rose
Neil Kandalgaonkar
Peter17