MediaWiki parser implementation

List overview All Threads
Download

newer

older

Wikitext Madness of the Day: List...

Proposed new parser

Andreas Jonsson

4 Aug 2010 4 Aug '10

8:10 a.m.

Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

Parser functions, magic words and html comments are better handled by a preprocessor than trying to integrate them with the parser (at least if you want preserve the current behavior). So I am only aiming at implementing something that can be plugged in after the preprocessing stages.

In the wikimodel project (http://code.google.com/p/wikimodel/) we are using a parser design that works well for wiki syntax; a front end (implemented using an LL-parser generator) scans the text and feeds events to a context object, which can be queried by the front end to enable context sensitive parsing. The context object will in turn feed a well formed sequence of events to a listener that may build a tree structure, generate xml, or any other format.

As of parser generators, Antlr seems to be the best choice. It have support for semantic predicates and rather sophisticated options for backtracking. I'm peeking at Steve Bennet's antlr grammar (http://www.mediawiki.org/wiki/Markup_spec/ANTLR), but I cannot really use that one, since the parsing algorothm is fundamentally different.

There are two problems with Antlr:

1. No php back-end

Writing a php back-end to antlr is a matter of providing a set of templates and porting the runtime. It's a lot of work, but seems fairly straightforward.

The parser can, of course, be written in C and be deployed as a php extension. The drawback is that it will be harder to deploy it, while the advantage is the performance. For MediaWiki it might be worth to maintain both a php and a C version though, since both speed and deployability are important.

2. No UTF-8 support in the C runtime in the latest release of antlr.

In trunk it has support of various character encodings,though, so it will probably be there in the next release.

My implementation is just at the beginning stages, but I have successfully reproduced the exact behavior of MediaWiki's parsing of apostrophes, which seems to be by far the hardest part. :)

I put it up right here if anyone is interested at looking at it:

http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.tar...

Best regards,

Andreas Jonsson

Show replies by date

Mingli Yuan

4 Aug 4 Aug

12:21 p.m.

Hello, Andreas,

I am interesting with your project.

But I can not download the source, could you send it to me via mail (mingli.yuan AT gmail.com)

Thanks a lot.

Regards, Mingli

On Wed, Aug 4, 2010 at 6:10 AM, Andreas Jonsson andreas.jonsson@kreablo.sewrote:

...

Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

Parser functions, magic words and html comments are better handled by a preprocessor than trying to integrate them with the parser (at least if you want preserve the current behavior). So I am only aiming at implementing something that can be plugged in after the preprocessing stages.

In the wikimodel project (http://code.google.com/p/wikimodel/) we are using a parser design that works well for wiki syntax; a front end (implemented using an LL-parser generator) scans the text and feeds events to a context object, which can be queried by the front end to enable context sensitive parsing. The context object will in turn feed a well formed sequence of events to a listener that may build a tree structure, generate xml, or any other format.

As of parser generators, Antlr seems to be the best choice. It have support for semantic predicates and rather sophisticated options for backtracking. I'm peeking at Steve Bennet's antlr grammar (http://www.mediawiki.org/wiki/Markup_spec/ANTLR), but I cannot really use that one, since the parsing algorothm is fundamentally different.

There are two problems with Antlr:

No php back-end

Writing a php back-end to antlr is a matter of providing a set of templates and porting the runtime. It's a lot of work, but seems fairly straightforward.

The parser can, of course, be written in C and be deployed as a php extension. The drawback is that it will be harder to deploy it, while the advantage is the performance. For MediaWiki it might be worth to maintain both a php and a C version though, since both speed and deployability are important.

No UTF-8 support in the C runtime in the latest release of antlr.

In trunk it has support of various character encodings,though, so it will probably be there in the next release.

My implementation is just at the beginning stages, but I have successfully reproduced the exact behavior of MediaWiki's parsing of apostrophes, which seems to be by far the hardest part. :)

I put it up right here if anyone is interested at looking at it:

http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.tar...

Best regards,

Andreas Jonsson

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Mark Clements (HappyDog)

5 Aug 5 Aug

2:15 a.m.

"Andreas Jonsson" andreas.jonsson@kreablo.se wrote in message news:4C5893E1.5060609@kreablo.se...

...

Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

[SNIP]

...

I put it up right here if anyone is interested at looking at it:

http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.tar...

Hi Andreas

It would be great if this could go somewhere on mediawiki.org. Please put the code into a dedicated page, using the appropriate language tags (for syntax colouring) rather than uploading the zip file, and link it in to the grammar navigation template [1].

Cheers, and good luck!

- Mark Clements (HappyDog)

[1] http://www.mediawiki.org/wiki/Template:Grammar_nav

Andreas Jonsson

7:42 p.m.

Hi Mark,

The implementation consists of two components: grammar (antlr file) + context class (several C-files), and the package also contains build scripts and unit tests, so it may not be suited for putting on a wiki page. It would be more appropriate to put it in a source code repository. I intend to do this eventually. Maybe wikimedia's subversion repository can be used for this?

Best regards,

Andreas 2010-08-04 18:15, Mark Clements (HappyDog) skrev:

...

"Andreas Jonsson"andreas.jonsson@kreablo.se wrote in message news:4C5893E1.5060609@kreablo.se...

...
Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

[SNIP]

...
I put it up right here if anyone is interested at looking at it:

http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.tar...

Hi Andreas

It would be great if this could go somewhere on mediawiki.org. Please put the code into a dedicated page, using the appropriate language tags (for syntax colouring) rather than uploading the zip file, and link it in to the grammar navigation template [1].

Cheers, and good luck!

Mark Clements (HappyDog)

[1] http://www.mediawiki.org/wiki/Template:Grammar_nav

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Platonides

10:03 p.m.

Andreas Jonsson wrote:

...

Hi Mark,

The implementation consists of two components: grammar (antlr file) + context class (several C-files), and the package also contains build scripts and unit tests, so it may not be suited for putting on a wiki page. It would be more appropriate to put it in a source code repository. I intend to do this eventually. Maybe wikimedia's subversion repository can be used for this?

Best regards,

Indeed it could. You shouldn't have problems to get access there. See http://www.mediawiki.org/wiki/Commit_access_requests

Chad

10:17 p.m.

On Tue, Aug 3, 2010 at 6:10 PM, Andreas Jonsson andreas.jonsson@kreablo.se wrote:

...

Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

/me grabs popcorn

-Chad

Magnus Manske

10:41 p.m.

On Thu, Aug 5, 2010 at 1:17 PM, Chad innocentkiller@gmail.com wrote:

...

On Tue, Aug 3, 2010 at 6:10 PM, Andreas Jonsson andreas.jonsson@kreablo.se wrote:

...
Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

/me grabs popcorn

/me grabs tear tissues...

Daniel Kinzler

10:45 p.m.

Magnus Manske schrieb:

...

On Thu, Aug 5, 2010 at 1:17 PM, Chad innocentkiller@gmail.com wrote:

...
On Tue, Aug 3, 2010 at 6:10 PM, Andreas Jonsson andreas.jonsson@kreablo.se wrote:

...
Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

/me grabs popcorn

/me grabs tear tissues...

/me grabs pumping lemma

Andreas Jonsson

6 Aug 6 Aug

7:06 a.m.

2010-08-05 14:45, Daniel Kinzler skrev:

...

Magnus Manske schrieb:

...
On Thu, Aug 5, 2010 at 1:17 PM, Chadinnocentkiller@gmail.com wrote:

...
On Tue, Aug 3, 2010 at 6:10 PM, Andreas Jonsson andreas.jonsson@kreablo.se wrote:

...
Hello,

I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :)

/me grabs popcorn

/me grabs tear tissues...

/me grabs pumping lemma

C'mon, don't be so pessimistic. :) Since I'm writing a context aware parser I don't see why you bring up the pumping lemma. Unless you have formed one for the class of languages that is accepted by the class of parsers represented by the algorithm I'm using.

/Andreas

Daniel Kinzler

6:57 p.m.

Andreas Jonsson schrieb:

...

...
/me grabs pumping lemma

C'mon, don't be so pessimistic. :) Since I'm writing a context aware parser I don't see why you bring up the pumping lemma. Unless you have formed one for the class of languages that is accepted by the class of parsers represented by the algorithm I'm using.

Sorry, didn't really look at the algorithm you propose. I just know that one reasons previous attempts failed is that wikitext isn't context free - and thus can't be parsed using an LL(*) grammar as used by Antlr. Didn't you say you'd use Antrl? Perhaps a packrat parser would work...

-- daniel

Andreas Jonsson

7:46 p.m.

2010-08-06 10:57, Daniel Kinzler skrev:

...

Andreas Jonsson schrieb:

...
...
/me grabs pumping lemma

C'mon, don't be so pessimistic. :) Since I'm writing a context aware parser I don't see why you bring up the pumping lemma. Unless you have formed one for the class of languages that is accepted by the class of parsers represented by the algorithm I'm using.

Sorry, didn't really look at the algorithm you propose. I just know that one reasons previous attempts failed is that wikitext isn't context free - and thus can't be parsed using an LL(*) grammar as used by Antlr. Didn't you say you'd use Antrl? Perhaps a packrat parser would work...

I'm using an antlr generated front end that interacts with a context object that provides the parser with hints. I'm relying on that the token stream is stable, so anything that might produce new tokens after parsing have to be taken care of by a preprocessor, (removing comments, inclusions, magic words, etc.).

As an example, to parse the apostrophe jungles, a prescan of the token stream is performed to collect all apostrophe sequences. Then the token stream is rewinded, the context computes hints and do the actual parsing. So the grammar productions in antlr look like this:

apostrophes: (apostrophe)+ { if(X->inlinePrescan) { X->onApostrophesPrescan(X, $text); } };

bold: {X->takeBold}?=> (begin_bold | end_bold); italic: {X->takeItalic}?=> (begin_italic | end_italic); apostrophe: {X->inlinePrescan}?=> A | {X->takeApostrophe}?=> A {X->onConsumedApostrophes(X);} ;

begin_bold: {!X->inBold}?=> A A A {X->beginBold(X);} ; end_bold: { X->inBold}?=> A A A {X->endBold(X);} ; begin_italic: {!X->inItalic}?=> A A {X->beginItalic(X);}; end_italic: { X->inItalic}?=> A A {X->endItalic(X);} ;

Now, computing hints to reproduce the MediaWiki apostrophe parsing is ridiculously complex (see the below method), but it is at least a separate concern. Also, I don't think that there is anything more complex in the syntax, so I feel confident about the general idea of the antlr/context combination.

Best regards,

Andreas

static void endInlinePrescan(MWPARSERCONTEXT *context) { context->inlinePrescan = false;

if (context->apostropheSequences != NULL) { pANTLR3_VECTOR v = context->apostropheSequences;

int i;

int two = 0; int three = 0;

APOSTROPHE_SEQUENCE *victim = NULL;

for (i=0; i < v->count; i++) { APOSTROPHE_SEQUENCE *s = v->get(v, i); switch (s->sequence->len) { case 1: break; case 2: two++; break; case 3: case 4: three++; victim = checkPotentialVictim(victim, s); break; default: two++; three++; break; }

}

if (two % 2 != 1 || three % 2 != 1) { victim = NULL; }

pANTLR3_VECTOR pi = context->parseInlineInstruction; pi->clear(pi);

int openingFive = -1; int italic = 0; int bold = 0;

#define APOSTROPHE pi->add(pi, &AD_APOSTROPHE_CONST, NULL)

#define ITALIC \ do { \ italic++; \ pi->add(pi, &AD_ITALIC_CONST, NULL); \ if (openingFive != -1) { \ /* \ * Swap bold and italic. \ */ \ assert(*(APOSTROPHE_DIRECTION*)pi->get(pi, openingFive) == AD_ITALIC && \ *(APOSTROPHE_DIRECTION*)pi->get(pi, openingFive + 1) == AD_BOLD); \ pi->swap(pi, openingFive, openingFive + 1); \ openingFive = -1; \ } \ } while (0)

#define BOLD \ do { \ bold++; \ pi->add(pi, &AD_BOLD_CONST, NULL); \ openingFive = -1; \ } while (0)

for (i=0; i < v->count; i++) { APOSTROPHE_SEQUENCE *s = v->get(v, i); switch (s->sequence->len) { case 1: APOSTROPHE; break; case 2: ITALIC; break; case 3: if (s == victim) { APOSTROPHE; ITALIC; } else { BOLD; } break; case 4: if (s == victim) { APOSTROPHE; APOSTROPHE; ITALIC; } else { APOSTROPHE; BOLD; } break; default: { int j; for (j = 0;j < s->sequence->len - 5; j++) { APOSTROPHE; } if (italic % 2 == 0 && bold % 2 == 0) { /* * Five (or more) apostrophes opening up new * formattings. We may need to swap the order * later, so we save the index. */ openingFive = pi->count; } pi->add(pi, &AD_ITALIC_CONST, NULL); italic++; pi->add(pi, &AD_BOLD_CONST, NULL); bold++; } break; } }

context->currentInlineInstruction = 0;

APOSTROPHE_DIRECTION ad = *(APOSTROPHE_DIRECTION*)pi->get(pi, 0);

context->takeApostrophe = ad == AD_APOSTROPHE; context->takeItalic = ad == AD_ITALIC; context->takeBold = ad == AD_BOLD;

context->apostropheSequences->free(context->apostropheSequences); context->apostropheSequences = NULL; } }

Daniel Kinzler

7:56 p.m.

Andreas Jonsson schrieb:

...

I'm using an antlr generated front end that interacts with a context object that provides the parser with hints. I'm relying on that the token stream is stable, so anything that might produce new tokens after parsing have to be taken care of by a preprocessor, (removing comments, inclusions, magic words, etc.).

That actually sounds pretty good!

...

Now, computing hints to reproduce the MediaWiki apostrophe parsing is ridiculously complex (see the below method), but it is at least a separate concern. Also, I don't think that there is anything more complex in the syntax, so I feel confident about the general idea of the antlr/context combination.

Nested tables can get pretty nasty, too. And mixed lists with indentations like *#::* that may or may not match the previous line's indentation might also cause trouble, i think.

Best of luck, Daniel

5248

Age (days ago)

5251

Last active (days ago)

wikitext-l@lists.wikimedia.org

11 comments

7 participants

tags (0)

participants (7)

Andreas Jonsson
Chad
Daniel Kinzler
Magnus Manske
Mark Clements (HappyDog)
Mingli Yuan
Platonides