[Wikitext-l] Parser now available in svn repository

Jan Paul Posma jp.posma at gmail.com
Sat Aug 28 15:09:02 UTC 2010


This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.

I'm really looking forward to have HTML output and the PHP integration. Amazing job!

Regards, Jan Paul

On 27-Aug-2010, at 20:00, Andreas Jonsson wrote:

> I have imported the parser implementation to the repository:
> 
> http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
> 
> Dependencies:
> 
> * antlr snapshot.  Be sure to apply the patch to the C runtime.
> 
> * libtre.  Regexp library for wide character strings.  (Not actually
>   used yet.)
> 
> There is no php integration yet.
> 
> Below is a list of cases I'm awaro of where the behavior differs from
> Parser.php.  (libmwparser doesn't actually output html at the moment,
> but in the below examples I've converted the traces to html in the
> obvious way for comparison.)
> 
> - Definition lists:
> 
> ;; item
> 
> Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl>
> libmwparser: <dl><dl><dt> item</dt></dl></dl>
> 
> - Html/table attributes:
> 
> {| id='a class='b'
> | col1
> |}
> 
> Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table>
> libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table>
> 
> (libmwparser does not backtrack to the space character to try to find
> a valid attribute, it just considers id='a class='<junk characters> to
> be garbage altoghether.)
> 
> - libmwparser restricts some block elements tokens to the correct
>   block contexts.
> 
> - inline formatting:
> 
> <b>'''bold'''</b>
> 
> Parser.php: <b><b>bold</b></b>
> libmwparser: <b>bold</b>
> 
> - long term formatting is applied to all inline text:
> 
> <i>text
> {|
> | col1
> |}
> text</i>
> 
> Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1 
> </td></tr></tbody></table><p><i>text</i></p>
> libmwparser: <p><i>text</i></p><table><tbody><tr><td><i> 
> col1</i></td></tr></tbody></table><p><i>text</i></p>
> 
> - internal links are treated as long term formatting:
> 
> [[Link|text
> {|
> | col1
> |}
> text]]
> 
> Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1 
> </td></tr></tbody></table><p>text</a></p>
> libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a 
> href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p>
> 
> - In general, any case that cause Parser.php to generate invalid html is
>   likely to differ in libmwparser.
> 
> Some benchmarking:
> 
> The performance isn't very impressive.
> 
> I've tried very quickly to make a comparison:
> 
> Parser.php:
> 
> * Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275
> 
> * I'm measuring from just before internalParse to right after
>   doBlockLevels.
> 
> libmwparse:
> 
> * 2.5GhZ core 2 duo
> 
> * The time for outputting the traces to /dev/null is included
> 
> 128kB of plain text:
> 
> Parser.php:  170ms
> libmwparser: 180ms
> 
> The page http://en.wikipedia.org/wiki/Wikipedia (templates not
> installed at the mediawiki test server) size 124kB
> 
> Parser.php:  720ms
> libmwparser: 190ms
> 
> As expected, Parser.php will take more time the more markup on the
> page, while libmwparser maintains a fairly constant pace.
> 
> /Andreas
> 
> 
> _______________________________________________
> Wikitext-l mailing list
> Wikitext-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l




More information about the Wikitext-l mailing list