[Wikitext-l] Parser now available in svn repository

Andreas Jonsson andreas.jonsson at kreablo.se
Fri Aug 27 18:00:02 UTC 2010


I have imported the parser implementation to the repository:

http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser

Dependencies:

* antlr snapshot.  Be sure to apply the patch to the C runtime.

* libtre.  Regexp library for wide character strings.  (Not actually
   used yet.)

There is no php integration yet.

Below is a list of cases I'm awaro of where the behavior differs from
Parser.php.  (libmwparser doesn't actually output html at the moment,
but in the below examples I've converted the traces to html in the
obvious way for comparison.)

- Definition lists:

;; item

Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl>
libmwparser: <dl><dl><dt> item</dt></dl></dl>

- Html/table attributes:

{| id='a class='b'
| col1
|}

Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table>
libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table>

(libmwparser does not backtrack to the space character to try to find
a valid attribute, it just considers id='a class='<junk characters> to
be garbage altoghether.)

- libmwparser restricts some block elements tokens to the correct
   block contexts.

- inline formatting:

<b>'''bold'''</b>

Parser.php: <b><b>bold</b></b>
libmwparser: <b>bold</b>

- long term formatting is applied to all inline text:

<i>text
{|
| col1
|}
text</i>

Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1 
</td></tr></tbody></table><p><i>text</i></p>
libmwparser: <p><i>text</i></p><table><tbody><tr><td><i> 
col1</i></td></tr></tbody></table><p><i>text</i></p>

- internal links are treated as long term formatting:

[[Link|text
{|
| col1
|}
text]]

Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1 
</td></tr></tbody></table><p>text</a></p>
libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a 
href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p>

- In general, any case that cause Parser.php to generate invalid html is
   likely to differ in libmwparser.

Some benchmarking:

The performance isn't very impressive.

I've tried very quickly to make a comparison:

Parser.php:

* Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275

* I'm measuring from just before internalParse to right after
   doBlockLevels.

libmwparse:

* 2.5GhZ core 2 duo

* The time for outputting the traces to /dev/null is included

128kB of plain text:

Parser.php:  170ms
libmwparser: 180ms

The page http://en.wikipedia.org/wiki/Wikipedia (templates not
installed at the mediawiki test server) size 124kB

Parser.php:  720ms
libmwparser: 190ms

As expected, Parser.php will take more time the more markup on the
page, while libmwparser maintains a fairly constant pace.

/Andreas




More information about the Wikitext-l mailing list