I have imported the parser implementation to the repository:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
Dependencies:
* antlr snapshot. Be sure to apply the patch to the C runtime.
* libtre. Regexp library for wide character strings. (Not actually used yet.)
There is no php integration yet.
Below is a list of cases I'm awaro of where the behavior differs from Parser.php. (libmwparser doesn't actually output html at the moment, but in the below examples I've converted the traces to html in the obvious way for comparison.)
- Definition lists:
;; item
Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl> libmwparser: <dl><dl><dt> item</dt></dl></dl>
- Html/table attributes:
{| id='a class='b' | col1 |}
Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table> libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table>
(libmwparser does not backtrack to the space character to try to find a valid attribute, it just considers id='a class='<junk characters> to be garbage altoghether.)
- libmwparser restricts some block elements tokens to the correct block contexts.
- inline formatting:
<b>'''bold'''</b>
Parser.php: <b><b>bold</b></b> libmwparser: <b>bold</b>
- long term formatting is applied to all inline text:
<i>text {| | col1 |} text</i>
Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1 </td></tr></tbody></table><p><i>text</i></p> libmwparser: <p><i>text</i></p><table><tbody><tr><td><i> col1</i></td></tr></tbody></table><p><i>text</i></p>
- internal links are treated as long term formatting:
[[Link|text {| | col1 |} text]]
Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1 </td></tr></tbody></table><p>text</a></p> libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p>
- In general, any case that cause Parser.php to generate invalid html is likely to differ in libmwparser.
Some benchmarking:
The performance isn't very impressive.
I've tried very quickly to make a comparison:
Parser.php:
* Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275
* I'm measuring from just before internalParse to right after doBlockLevels.
libmwparse:
* 2.5GhZ core 2 duo
* The time for outputting the traces to /dev/null is included
128kB of plain text:
Parser.php: 170ms libmwparser: 180ms
The page http://en.wikipedia.org/wiki/Wikipedia (templates not installed at the mediawiki test server) size 124kB
Parser.php: 720ms libmwparser: 190ms
As expected, Parser.php will take more time the more markup on the page, while libmwparser maintains a fairly constant pace.
/Andreas
On 27 August 2010 19:00, Andreas Jonsson andreas.jonsson@kreablo.se wrote:
The performance isn't very impressive. I've tried very quickly to make a comparison:
It's comparable, and I'm amazed!
:-O
So ... is it fit for people to drop into place and experiment with?
This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.
- d.
On 27 August 2010 22:11, David Gerard dgerard@gmail.com wrote:
This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.
http://davidgerard.co.uk/notes/2010/08/22/staring-into-the-eye-of-cthulhu/
- d.
Great news!
Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)
Regards, Mingli
On Sat, Aug 28, 2010 at 5:16 AM, David Gerard dgerard@gmail.com wrote:
On 27 August 2010 22:11, David Gerard dgerard@gmail.com wrote:
This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.
http://davidgerard.co.uk/notes/2010/08/22/staring-into-the-eye-of-cthulhu/
- d.
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Mingli Yuan wrote:
Great news!
Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)
Regards, Mingli
svn doesn't have the git inconvenient of forcing you to checkout all or nothing. To get only libmwparser just run: svn checkout http://svn.wikimedia.org/viewvc/mediawiki/trunk/parsers/libmwparser/
Thanks, Platonides.
I know the svn command to checkout or export a directory. But I want to emphasis that the collaboration model are different between svn and git.
For example, if I want to add some test cases, I need someone to grant me the svn access. While in github, I can push my contribution back to upstream very easily, and the original author can choice what back to him.
Above is just my advice, if Andreas choices svn, I will follow his change on svn also. Let's drop the irrelevant discussion on git/svn issue, and back to the main topic.
Regards, Mingli
On Sat, Aug 28, 2010 at 7:27 PM, Platonides platonides@gmail.com wrote:
Mingli Yuan wrote:
Great news!
Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)
Regards, Mingli
svn doesn't have the git inconvenient of forcing you to checkout all or nothing. To get only libmwparser just run: svn checkout http://svn.wikimedia.org/viewvc/mediawiki/trunk/parsers/libmwparser/
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
2010-08-28 05:01, Mingli Yuan skrev:
Great news!
Just an advice: could you commit the code into github? So we can fork it, play with and contribute back easily. Svn is not as convenient as git and the mediawiki svn repository is too large. Free software should be really free, not be bounded in a centralized repository. :-)
I too like git, and I often use it locally even if the central repository is using a different system; checking out a source tree from subversion right into a git working directory works pretty well for me.
I don't really have a strong preference on which system to use. It just seemed approprate to put it on mediawiki.org.
Github seems very nice, though, so maybe I'll reconsider. But free software is also not depending on any single person, so if you or someone else would like to take an initiative and setup a repository with a project page and issue tracker, I would be happy to support that.
Best regards,
Andreas
2010-08-27 23:11, David Gerard skrev:
On 27 August 2010 19:00, Andreas Jonssonandreas.jonsson@kreablo.se wrote:
The performance isn't very impressive. I've tried very quickly to make a comparison:
It's comparable, and I'm amazed!
:-O
I would have thought that a "real" parser would clearly be ahead on any input, but maybe I'm underestimating the original parser. Anyway, there should be room for optimizations. Profiling shows that about 64% of the time is spent in the parser, while 25% of the time is spent in the lexer. To me it seems that the parser have an easier task to do than the lexer, and should therefore be faster.
So ... is it fit for people to drop into place and experiment with?
No, not yet. Some features are missing and there is no php-integration yet. But unless I've missed something, there is nothing that I don't immediately know how to implement.
/Andreas
This should really be reported more widely :-) I posted on my blog about it (and have now linked this message and the svn). Worth announcing on mediawiki-l? Presumably when it's drop-in.
- d.
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.
I'm really looking forward to have HTML output and the PHP integration. Amazing job!
Regards, Jan Paul
On 27-Aug-2010, at 20:00, Andreas Jonsson wrote:
I have imported the parser implementation to the repository:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
Dependencies:
antlr snapshot. Be sure to apply the patch to the C runtime.
libtre. Regexp library for wide character strings. (Not actually used yet.)
There is no php integration yet.
Below is a list of cases I'm awaro of where the behavior differs from Parser.php. (libmwparser doesn't actually output html at the moment, but in the below examples I've converted the traces to html in the obvious way for comparison.)
- Definition lists:
;; item
Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl> libmwparser: <dl><dl><dt> item</dt></dl></dl>
- Html/table attributes:
{| id='a class='b' | col1 |}
Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table> libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table>
(libmwparser does not backtrack to the space character to try to find a valid attribute, it just considers id='a class='<junk characters> to be garbage altoghether.)
libmwparser restricts some block elements tokens to the correct block contexts.
inline formatting:
<b>'''bold'''</b>
Parser.php: <b><b>bold</b></b> libmwparser: <b>bold</b>
- long term formatting is applied to all inline text:
<i>text {| | col1 |} text</i>
Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1
</td></tr></tbody></table><p><i>text</i></p> libmwparser: <p><i>text</i></p><table><tbody><tr><td><i> col1</i></td></tr></tbody></table><p><i>text</i></p>
- internal links are treated as long term formatting:
[[Link|text {| | col1 |} text]]
Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1
</td></tr></tbody></table><p>text</a></p> libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p>
- In general, any case that cause Parser.php to generate invalid html is likely to differ in libmwparser.
Some benchmarking:
The performance isn't very impressive.
I've tried very quickly to make a comparison:
Parser.php:
Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275
I'm measuring from just before internalParse to right after doBlockLevels.
libmwparse:
2.5GhZ core 2 duo
The time for outputting the traces to /dev/null is included
128kB of plain text:
Parser.php: 170ms libmwparser: 180ms
The page http://en.wikipedia.org/wiki/Wikipedia (templates not installed at the mediawiki test server) size 124kB
Parser.php: 720ms libmwparser: 190ms
As expected, Parser.php will take more time the more markup on the page, while libmwparser maintains a fairly constant pace.
/Andreas
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 28 August 2010 16:09, Jan Paul Posma jp.posma@gmail.com wrote:
This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.
*May* be. You won't know until you test ;-)
- d.
On 28-Aug-2010, at 17:17, David Gerard wrote:
On 28 August 2010 16:09, Jan Paul Posma jp.posma@gmail.com wrote:
This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.
*May* be. You won't know until you test ;-)
Sure, sure, but Andreas' benchmarks are hopeful :-D
Regards, Jan Paul
Jan Paul Posma wrote:
This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that
kind of editing work. With
the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.
I'm really looking forward to have HTML output and the PHP integration. Amazing job!
Regards, Jan Paul
I don't think you would need to reparse the whole page. I think it would be feasible -touching the parser- to reparse just the paragraph.
2010-08-28 17:09, Jan Paul Posma skrev:
This is totally awesome. The biggest problem I'm facing with the sentence-level editor right now is that the whole page has to be reparsed in order to make that kind of editing work. With the current parser this takes a lot of time (>1 sec is not uncommon), but using your parser the speed will be good.
I'm really looking forward to have HTML output and the PHP integration. Amazing job!
I saw the demo of the sentence level editor and it looks really cool, but I don't think that you should expect any miracles regarding the parser performance.
However, as it is much easier to have a multitude of renderers, I would suggest to write a special renderer for the sentence level editor to label each sentence with an identifier. Then you could introduce a "save sentence" operation that saves the page, but only reparses the particular sentence.
/Andreas
I saw the demo of the sentence level editor and it looks really cool, but I don't think that you should expect any miracles regarding the parser performance.
Yeah, well, let's see how the final version with PHP integration performs. I just hope it'll be better than the current parser :-)
However, as it is much easier to have a multitude of renderers, I would suggest to write a special renderer for the sentence level editor to label each sentence with an identifier. Then you could introduce a "save sentence" operation that saves the page, but only reparses the particular sentence.
Yeah, this was my initial approach, but the problem is that there're dependencies across the page (i.e. references). Perhaps I can include a check to decide whether to update the whole page or only a part of it. For now though, updating the whole page seems like the most robust approach.
Anyway, what are your plans for PHP integration? It would be really nice to be able to include hooks after the lexer, but before actual parsing.
Regards, Jan Paul
2010-08-30 10:30, Jan Paul Posma skrev:
I saw the demo of the sentence level editor and it looks really cool, but I don't think that you should expect any miracles regarding the parser performance.
Yeah, well, let's see how the final version with PHP integration performs. I just hope it'll be better than the current parser :-)
However, as it is much easier to have a multitude of renderers, I would suggest to write a special renderer for the sentence level editor to label each sentence with an identifier. Then you could introduce a "save sentence" operation that saves the page, but only reparses the particular sentence.
Yeah, this was my initial approach, but the problem is that there're dependencies across the page (i.e. references). Perhaps I can include a check to decide whether to update the whole page or only a part of it. For now though, updating the whole page seems like the most robust approach.
Anyway, what are your plans for PHP integration?
The antlr runtime support reading input from a utf-8, utf-16 or utf-32 encoded buffer, so I guess that on the input side the integration should be trivial. The documentation on php.net is very limited, but I guess that one of those encodings are used?
On the ouput side there are two approaches:
1. Implement a listener that outputs html code to a php readable buffer.
2. Export the listener api to php.
1. is fast and 2. is flexible, so it would probably be a good idea to do both.
It would be really nice to be able to include hooks after the lexer, but before actual parsing.
That could be done, but I would not recommend it. What application do you have in mind?
/Andreas
It would be really nice to be able to include hooks after the lexer, but before actual parsing.
That could be done, but I would not recommend it. What application do you have in mind?
Well, the current implementation of my editor uses a bunch of regexes (like the current parser) to determine where to inject spans or divs into the wikitext. Having a more accurate representation (the tokenized wikitext that the lexer outputs) would allow for more accurate injection. Then again, it would be complicated to interface that with PHP, I guess?
How would you handle hooks, tag extensions, parser functions and magic words anyway? Will you leave this to some post-processing stage in PHP or have things interact during parsing?
Regards, Jan Paul
2010-08-30 13:22, Jan Paul Posma skrev:
It would be really nice to be able to include hooks after the lexer, but before actual parsing.
That could be done, but I would not recommend it. What application do you have in mind?
Well, the current implementation of my editor uses a bunch of regexes (like the current parser) to determine where to inject spans or divs into the wikitext. Having a more accurate representation (the tokenized wikitext that the lexer outputs) would allow for more accurate injection. Then again, it would be complicated to interface that with PHP, I guess?
Between the lexer and the parser there is just the stream of tokens. How that relates to the ultimately rendered content is non-trivial. I think that you would be much better off by working on top of the listener interface. It would be a help for you to, i'd guess, introduce the period character (or more generally, a localizable sentence seprator character) as it's own token and pass that as an event. But that cannot be efficiently implemented as a "hook", it has to be integrated in the lexer. But it should be perfectly possible to define sentences in the event stream even without such a token.
How would you handle hooks, tag extensions, parser functions and magic words anyway? Will you leave this to some post-processing stage in PHP or have things interact during parsing?
The listener interface in itself constitutes a collection of hooks. From the parser's point of view, a tag extension works the same as <nowiki>. It's up to the listening application to call the appropriate function to process the content. Magic words and parser functions should be handled by a preprocessor, as the substitution of these may yield new tokens.
/Andreas
wikitext-l@lists.wikimedia.org