Parser now available in svn repository

List overview All Threads
Download

newer

older

Image links

Wikitext Madness of the Day:...

Andreas Jonsson

28 Aug 2010 28 Aug '10

4 a.m.

I have imported the parser implementation to the repository:

http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser

Dependencies:

* antlr snapshot. Be sure to apply the patch to the C runtime.

* libtre. Regexp library for wide character strings. (Not actually used yet.)

There is no php integration yet.

Below is a list of cases I'm awaro of where the behavior differs from Parser.php. (libmwparser doesn't actually output html at the moment, but in the below examples I've converted the traces to html in the obvious way for comparison.)

- Definition lists:

;; item

Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl> libmwparser: <dl><dl><dt> item</dt></dl></dl>

- Html/table attributes:

{| id='a class='b' | col1 |}

Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table> libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table>

(libmwparser does not backtrack to the space character to try to find a valid attribute, it just considers id='a class='<junk characters> to be garbage altoghether.)

- libmwparser restricts some block elements tokens to the correct block contexts.

- inline formatting:

Parser.php: <b><b>bold</b></b> libmwparser: <b>bold</b>

- long term formatting is applied to all inline text:

Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1 </td></tr></tbody></table><p><i>text</i></p> libmwparser: <p><i>text</i></p><table><tbody><tr><td><i> col1</i></td></tr></tbody></table><p><i>text</i></p>

- internal links are treated as long term formatting:

[[Link|text {| | col1 |} text]]

Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1 </td></tr></tbody></table><p>text</a></p> libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p>

- In general, any case that cause Parser.php to generate invalid html is likely to differ in libmwparser.

Some benchmarking:

The performance isn't very impressive.

I've tried very quickly to make a comparison:

Parser.php:

* Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275

* I'm measuring from just before internalParse to right after doBlockLevels.

libmwparse:

* 2.5GhZ core 2 duo

* The time for outputting the traces to /dev/null is included

128kB of plain text:

Parser.php: 170ms libmwparser: 180ms

The page http://en.wikipedia.org/wiki/Wikipedia (templates not installed at the mediawiki test server) size 124kB

Parser.php: 720ms libmwparser: 190ms

As expected, Parser.php will take more time the more markup on the page, while libmwparser maintains a fairly constant pace.

/Andreas

Show replies by date

David Gerard

28 Aug 28 Aug

7:11 a.m.

On 27 August 2010 19:00, Andreas Jonsson andreas.jonsson@kreablo.se wrote:

...

The performance isn't very impressive. I've tried very quickly to make a comparison:

It's comparable, and I'm amazed!

:-O

So ... is it fit for people to drop into place and experiment with?

This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.

- d.

David Gerard

7:16 a.m.

On 27 August 2010 22:11, David Gerard dgerard@gmail.com wrote:

...

This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.

http://davidgerard.co.uk/notes/2010/08/22/staring-into-the-eye-of-cthulhu/

- d.

Mingli Yuan

1:01 p.m.

Great news!

Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)

Regards, Mingli

On Sat, Aug 28, 2010 at 5:16 AM, David Gerard dgerard@gmail.com wrote:

...

On 27 August 2010 22:11, David Gerard dgerard@gmail.com wrote:

...
This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.

http://davidgerard.co.uk/notes/2010/08/22/staring-into-the-eye-of-cthulhu/

d.

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Platonides

9:27 p.m.

Mingli Yuan wrote:

...

Great news!

Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)

Regards, Mingli

svn doesn't have the git inconvenient of forcing you to checkout all or nothing. To get only libmwparser just run: svn checkout http://svn.wikimedia.org/viewvc/mediawiki/trunk/parsers/libmwparser/

Mingli Yuan

10:39 p.m.

Thanks, Platonides.

I know the svn command to checkout or export a directory. But I want to emphasis that the collaboration model are different between svn and git.

For example, if I want to add some test cases, I need someone to grant me the svn access. While in github, I can push my contribution back to upstream very easily, and the original author can choice what back to him.

Above is just my advice, if Andreas choices svn, I will follow his change on svn also. Let's drop the irrelevant discussion on git/svn issue, and back to the main topic.

Regards, Mingli

On Sat, Aug 28, 2010 at 7:27 PM, Platonides platonides@gmail.com wrote:

...

Mingli Yuan wrote:

...
Great news!

Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)

Regards, Mingli

svn doesn't have the git inconvenient of forcing you to checkout all or nothing. To get only libmwparser just run: svn checkout http://svn.wikimedia.org/viewvc/mediawiki/trunk/parsers/libmwparser/

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Andreas Jonsson

30 Aug 30 Aug

5:07 p.m.

2010-08-28 05:01, Mingli Yuan skrev:

...

Great news!

Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)

I too like git, and I often use it locally even if the central repository is using a different system; checking out a source tree from subversion right into a git working directory works pretty well for me.

I don't really have a strong preference on which system to use. It just seemed approprate to put it on mediawiki.org.

Github seems very nice, though, so maybe I'll reconsider. But free software is also not depending on any single person, so if you or someone else would like to take an initiative and setup a repository with a project page and issue tracker, I would be happy to support that.

Best regards,

Andreas

Andreas Jonsson

4:50 p.m.

2010-08-27 23:11, David Gerard skrev:

...

On 27 August 2010 19:00, Andreas Jonssonandreas.jonsson@kreablo.se wrote:

...
The performance isn't very impressive. I've tried very quickly to make a comparison:

It's comparable, and I'm amazed!

:-O

I would have thought that a "real" parser would clearly be ahead on any input, but maybe I'm underestimating the original parser. Anyway, there should be room for optimizations. Profiling shows that about 64% of the time is spent in the parser, while 25% of the time is spent in the lexer. To me it seems that the parser have an easier task to do than the lexer, and should therefore be faster.

...

So ... is it fit for people to drop into place and experiment with?

No, not yet. Some features are missing and there is no php-integration yet. But unless I've missed something, there is nothing that I don't immediately know how to implement.

/Andreas

...

This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.

d.

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Jan Paul Posma

29 Aug 29 Aug

1:09 a.m.

This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.

I'm really looking forward to have HTML output and the PHP integration. Amazing job!

Regards, Jan Paul

On 27-Aug-2010, at 20:00, Andreas Jonsson wrote:

...

I have imported the parser implementation to the repository:

http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser

Dependencies:

antlr snapshot. Be sure to apply the patch to the C runtime.

libtre. Regexp library for wide character strings. (Not actually used yet.)

There is no php integration yet.

Below is a list of cases I'm awaro of where the behavior differs from Parser.php. (libmwparser doesn't actually output html at the moment, but in the below examples I've converted the traces to html in the obvious way for comparison.)

Definition lists:

;; item

Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl> libmwparser: <dl><dl><dt> item</dt></dl></dl>

Html/table attributes:

{| id='a class='b' | col1 |}

Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table> libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table>

(libmwparser does not backtrack to the space character to try to find a valid attribute, it just considers id='a class='<junk characters> to be garbage altoghether.)

libmwparser restricts some block elements tokens to the correct block contexts.

inline formatting:

<b>'''bold'''</b>

Parser.php: <b><b>bold</b></b> libmwparser: <b>bold</b>

long term formatting is applied to all inline text:

<i>text {| | col1 |} text</i>

Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1

</td></tr></tbody></table><p><i>text</i></p> libmwparser: <p><i>text</i></p><table><tbody><tr><td><i> col1</i></td></tr></tbody></table><p><i>text</i></p>

internal links are treated as long term formatting:

[[Link|text {| | col1 |} text]]

Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1

</td></tr></tbody></table><p>text</a></p> libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p>

In general, any case that cause Parser.php to generate invalid html is likely to differ in libmwparser.

Some benchmarking:

The performance isn't very impressive.

I've tried very quickly to make a comparison:

Parser.php:

Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275

I'm measuring from just before internalParse to right after doBlockLevels.

libmwparse:

2.5GhZ core 2 duo

The time for outputting the traces to /dev/null is included

128kB of plain text:

Parser.php: 170ms libmwparser: 180ms

The page http://en.wikipedia.org/wiki/Wikipedia (templates not installed at the mediawiki test server) size 124kB

Parser.php: 720ms libmwparser: 190ms

As expected, Parser.php will take more time the more markup on the page, while libmwparser maintains a fairly constant pace.

/Andreas

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

David Gerard

1:17 a.m.

On 28 August 2010 16:09, Jan Paul Posma jp.posma@gmail.com wrote:

...

This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.

*May* be. You won't know until you test ;-)

- d.

Jan Paul Posma

2:28 a.m.

On 28-Aug-2010, at 17:17, David Gerard wrote:

...

On 28 August 2010 16:09, Jan Paul Posma jp.posma@gmail.com wrote:

...
This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.

*May* be. You won't know until you test ;-)

Sure, sure, but Andreas' benchmarks are hopeful :-D

Regards, Jan Paul

Platonides

4:10 a.m.

Jan Paul Posma wrote:

...

This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that

kind of editing work. With

...

the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.

I'm really looking forward to have HTML output and the PHP integration. Amazing job!

Regards, Jan Paul

I don't think you would need to reparse the whole page. I think it would be feasible -touching the parser- to reparse just the paragraph.

Andreas Jonsson

30 Aug 30 Aug

5:37 p.m.

2010-08-28 17:09, Jan Paul Posma skrev:

...

This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.

I'm really looking forward to have HTML output and the PHP integration. Amazing job!

I saw the demo of the sentence level editor and it looks really cool, but I don't think that you should expect any miracles regarding the parser performance.

However, as it is much easier to have a multitude of renderers, I would suggest to write a special renderer for the sentence level editor to label each sentence with an identifier. Then you could introduce a "save sentence" operation that saves the page, but only reparses the particular sentence.

/Andreas

Jan Paul Posma

6:30 p.m.

...

I saw the demo of the sentence level editor and it looks really cool, but I don't think that you should expect any miracles regarding the parser performance.

Yeah, well, let's see how the final version with PHP integration performs. I just hope it'll be better than the current parser :-)

...

However, as it is much easier to have a multitude of renderers, I would suggest to write a special renderer for the sentence level editor to label each sentence with an identifier. Then you could introduce a "save sentence" operation that saves the page, but only reparses the particular sentence.

Yeah, this was my initial approach, but the problem is that there're dependencies across the page (i.e. references). Perhaps I can include a check to decide whether to update the whole page or only a part of it. For now though, updating the whole page seems like the most robust approach.

Anyway, what are your plans for PHP integration? It would be really nice to be able to include hooks after the lexer, but before actual parsing.

Regards, Jan Paul

Andreas Jonsson

7:05 p.m.

2010-08-30 10:30, Jan Paul Posma skrev:

...

...
I saw the demo of the sentence level editor and it looks really cool, but I don't think that you should expect any miracles regarding the parser performance.

Yeah, well, let's see how the final version with PHP integration performs. I just hope it'll be better than the current parser :-)

...
However, as it is much easier to have a multitude of renderers, I would suggest to write a special renderer for the sentence level editor to label each sentence with an identifier. Then you could introduce a "save sentence" operation that saves the page, but only reparses the particular sentence.

Yeah, this was my initial approach, but the problem is that there're dependencies across the page (i.e. references). Perhaps I can include a check to decide whether to update the whole page or only a part of it. For now though, updating the whole page seems like the most robust approach.

Anyway, what are your plans for PHP integration?

The antlr runtime support reading input from a utf-8, utf-16 or utf-32 encoded buffer, so I guess that on the input side the integration should be trivial. The documentation on php.net is very limited, but I guess that one of those encodings are used?

On the ouput side there are two approaches:

1. Implement a listener that outputs html code to a php readable buffer.

2. Export the listener api to php.

1. is fast and 2. is flexible, so it would probably be a good idea to do both.

...

It would be really nice to be able to include hooks after the lexer, but before actual parsing.

That could be done, but I would not recommend it. What application do you have in mind?

/Andreas

Jan Paul Posma

9:22 p.m.

...

...
It would be really nice to be able to include hooks after the lexer, but before actual parsing.

That could be done, but I would not recommend it. What application do you have in mind?

Well, the current implementation of my editor uses a bunch of regexes (like the current parser) to determine where to inject spans or divs into the wikitext. Having a more accurate representation (the tokenized wikitext that the lexer outputs) would allow for more accurate injection. Then again, it would be complicated to interface that with PHP, I guess?

How would you handle hooks, tag extensions, parser functions and magic words anyway? Will you leave this to some post-processing stage in PHP or have things interact during parsing?

Regards, Jan Paul

Andreas Jonsson

10:21 p.m.

2010-08-30 13:22, Jan Paul Posma skrev:

...

...
...
It would be really nice to be able to include hooks after the lexer, but before actual parsing.

That could be done, but I would not recommend it. What application do you have in mind?

Well, the current implementation of my editor uses a bunch of regexes (like the current parser) to determine where to inject spans or divs into the wikitext. Having a more accurate representation (the tokenized wikitext that the lexer outputs) would allow for more accurate injection. Then again, it would be complicated to interface that with PHP, I guess?

Between the lexer and the parser there is just the stream of tokens. How that relates to the ultimately rendered content is non-trivial. I think that you would be much better off by working on top of the listener interface. It would be a help for you to, i'd guess, introduce the period character (or more generally, a localizable sentence seprator character) as it's own token and pass that as an event. But that cannot be efficiently implemented as a "hook", it has to be integrated in the lexer. But it should be perfectly possible to define sentences in the event stream even without such a token.

...

How would you handle hooks, tag extensions, parser functions and magic words anyway? Will you leave this to some post-processing stage in PHP or have things interact during parsing?

The listener interface in itself constitutes a collection of hooks. From the parser's point of view, a tag extension works the same as <nowiki>. It's up to the listening application to call the appropriate function to process the content. Magic words and parser functions should be handled by a preprocessor, as the substitution of these may yield new tokens.

/Andreas

5224

Age (days ago)

5227

Last active (days ago)

wikitext-l@lists.wikimedia.org

16 comments

5 participants

tags (0)

participants (5)

Andreas Jonsson
David Gerard
Jan Paul Posma
Mingli Yuan
Platonides