Brion said to me a couple of weeks ago "the parser is slow for large articles, fix it". So along these lines, I have rewritten the preprocessor phase to make it faster in PHP. I also have plans for further speed improvement via a partial port to C.
This work was planned and started before the recent parser discussions on wikitech-l, by Steve Bennett et al. I chose to ignore those discussions to improve my productivity. Apologies if I'm stepping on any toes.
I'll cover the technical side of this first, and then the impact for the user in terms of wikitext syntax change.
This text is mostly adapted from my entry in RELEASE-NOTES.
== Technical viewpoint ==
The parser pass order has changed from
* Extension tag strip and render * HTML normalisation and security * Template expansion * Main section...
to
* Template and extension tag parse to intermediate representation * Template expansion and extension rendering * HTML normalisation and security * Main section...
The new two-pass preprocessor can skip "dead branches" in template expansion, such as unfollowed #if cases and unused defaults for template arguments. This provides a significant performance improvement in template-heavy test cases taken from Wikipedia. Parser function hooks can participate in this performance improvement by using the new SFH_OBJECT_ARGS flag during registration.
The intermediate representation I have used is a DOM document tree, taking advantage of PHP's standard access to libxml's efficient tree structures. I construct the tree via an XML text stage, although it could be done directly with DOM. My gut feeling was that the XML implementation would be faster, but I've made the interfaces such that it could be done either way. The XML form is not exposed.
One reason for using an intermediate representation is so that the parse results for templates can be cached. The theory is that the cached results can then be used to efficiently expand templates with changeable arguments, such as {{cite web}}. ( There's also an expansion cache for templates expanded with no arguments, such as {{•}}. )
Another reason is that I couldn't see any efficient (O(N) worst-case time order) way to implement dead branch elimination without an intermediate representation.
The pre-expand include size limit has been removed, since there's no efficient way to calculate such a figure, and it would now be meaningless for performance anyway. The "preprocessor node count" takes its place, with a generous default limit.
The context in which XML-style extension tags are called has changed, so extensions which make use of the parser state may need compatibility changes. Since extension tags are now rendered simultaneously with template expansion, there is a possibility for future improvement of the extension tag interface. For example, we could have preprocessor-transparent tags which act like parser functions, and we could give extension tags access to the template arguments (i.e. triple brace expansion).
== User viewpoint ==
The main effect of this for the user is that the rules for uncovered syntax have changed.
Uncovered main-pass syntax, such as HTML tags, are now generally valid, whereas previously in some cases they were escaped. For example, you could have "<ta" in one template, and "ble>" in another template, and put them together to make a valid <table> tag. Previously the result would have been "<table>".
Uncovered preprocessor syntax is generally not recognised. For example, if you have "{{a" in Template:A and "b}}" in Template:B, then "{{a}}{{b}}" will be converted to a literal "{{ab}}" rather than the contents of Template:Ab. This was the case previously in HTML output mode, and is now uniformly the case in the other modes as well. HTML-style comments uncovered by template expansion will not be recognised by the preprocessor and hence will not prevent template expansion within them, but they will be stripped by the following HTML security pass.
The rules for template expansion during message transformation were counterintuitive, mostly accidental and buggy. There are a few small changes in this version: for example, templates with dynamic names, as in "{{ {{a}} }}", are fully expanded as they are in HTML mode, whereas previously only the inner template was expanded. I'd like to make some larger breaking changes to message transformation, after a review of typical use cases.
The header identification routines for section edit and for numbering section edit links have been merged. This removes a significant failure mode and fixes a whole category of bugs (tracked by bug #4899). Wikitext headings uncovered by template expansion or comment removal will still be rendered into a heading tag, and will get an entry in the TOC, but will not have a section edit link. HTML-style headings will also not have a section edit link. Valid wikitext headings present in the template source text will get a template section edit link. This is a major break from previous behaviour, but I believe the effects are almost entirely beneficial.
-- Tim Starling
On 21/11/2007, Tim Starling tstarling@wikimedia.org wrote:
This work was planned and started before the recent parser discussions on wikitech-l, by Steve Bennett et al. I chose to ignore those discussions to improve my productivity. Apologies if I'm stepping on any toes.
It's a reimplementation, as long as we know when the target's moved ;-) I forwarded a copy to wikitext-l.
- d.
On 21/11/2007, Tim Starling tstarling@wikimedia.org wrote:
previously only the inner template was expanded. I'd like to make some larger breaking changes to message transformation, after a review of typical use cases.
Oh, and please do post *planned* wikitext syntax changes such as these to wikitext-l.
-d.
On 11/21/07, Tim Starling tstarling@wikimedia.org wrote:
This work was planned and started before the recent parser discussions on wikitech-l, by Steve Bennett et al. I chose to ignore those discussions to improve my productivity. Apologies if I'm stepping on any toes.
Not at all. Actual productive work should always take precedence over vapourware :)
I've actually been mostly ignoring the preprocessor until now - it could certainly be redone in the same style as the main parser, but it would (I think) always have to be a separate step, to handle template transclusion.
Realistically, even if I/we get the grammar fomalised and hack up a quick ANTLR-based parser, a usable new parser is at least 6 months away, even assuming all goes well.
Also the fact that performance has become an issue probably leans even more towards the hand-built solution rather than a generated parser.
Steve
On Wed, Nov 21, 2007 at 02:00:34PM +1100, Steve Bennett wrote:
Also the fact that performance has become an issue probably leans even more towards the hand-built solution rather than a generated parser.
It does for WMF, maybe.
What *I* want to take away is a parser that does the first 80% of WT syntax, that I can drop into my CMS. And I'm certainly not alone.
Cheers, -- jra
Jay R. Ashworth wrote:
On Wed, Nov 21, 2007 at 02:00:34PM +1100, Steve Bennett wrote:
Also the fact that performance has become an issue probably leans even more towards the hand-built solution rather than a generated parser.
It does for WMF, maybe.
What *I* want to take away is a parser that does the first 80% of WT syntax, that I can drop into my CMS. And I'm certainly not alone.
If you want this you, or someone with a similar goal, will have to code it yourself. It is unlikely anybody would do it for you, especially if they just contribute to MediaWiki. However, I have been told, it is fairly simple to just override some of the functions to remove any database stuff - that would make it work in any CMS.
MinuteElectron.
On Wed, Nov 21, 2007 at 05:38:18PM +0000, MinuteElectron wrote:
Jay R. Ashworth wrote:
On Wed, Nov 21, 2007 at 02:00:34PM +1100, Steve Bennett wrote:
Also the fact that performance has become an issue probably leans even more towards the hand-built solution rather than a generated parser.
It does for WMF, maybe.
What *I* want to take away is a parser that does the first 80% of WT syntax, that I can drop into my CMS. And I'm certainly not alone.
If you want this you, or someone with a similar goal, will have to code it yourself. It is unlikely anybody would do it for you, especially if they just contribute to MediaWiki. However, I have been told, it is fairly simple to just override some of the functions to remove any database stuff - that would make it work in any CMS.
Well, I'm not at all sure that my desire is unreasonable and your response, reasonable.
Very early in Steve's work, the issue was raised that one of the targets driving the effort of defining the language so that the parser could be reimplemented was the fact that it couldn't be a bad thing if people needing lightweight markup languages for other purposes could easily utilise mwtext for that.
Cheers, -- jra
On Wed, 21 Nov 2007 12:46:32 -0500, Jay R. Ashworth wrote:
On Wed, Nov 21, 2007 at 05:38:18PM +0000, MinuteElectron wrote:
Jay R. Ashworth wrote:
On Wed, Nov 21, 2007 at 02:00:34PM +1100, Steve Bennett wrote:
Also the fact that performance has become an issue probably leans even more towards the hand-built solution rather than a generated parser.
It does for WMF, maybe.
What *I* want to take away is a parser that does the first 80% of WT syntax, that I can drop into my CMS. And I'm certainly not alone.
If you want this you, or someone with a similar goal, will have to code it yourself. It is unlikely anybody would do it for you, especially if they just contribute to MediaWiki. However, I have been told, it is fairly simple to just override some of the functions to remove any database stuff - that would make it work in any CMS.
Well, I'm not at all sure that my desire is unreasonable and your response, reasonable.
Very early in Steve's work, the issue was raised that one of the targets driving the effort of defining the language so that the parser could be reimplemented was the fact that it couldn't be a bad thing if people needing lightweight markup languages for other purposes could easily utilise mwtext for that.
Cheers, -- jra
My perspective is mostly, what is the data set tied to? I.e. a specification, a reusable component, or a set of applications.
In that sense, it's probably not so much useful to create a parser to drop into other apps for its own sake, as that that would be complementary to less encumbered data.
n 11/21/07, Tim Starling tstarling@wikimedia.org wrote: <snip>
The parser pass order has changed from
<snip>
* Template and extension tag parse to intermediate representation * Template expansion and extension rendering * HTML normalisation and security * Main section...
<snip>
The intermediate representation I have used is a DOM document tree, taking
<snip>
Uncovered main-pass syntax, such as HTML tags, are now generally valid, whereas previously in some cases they were escaped. For example, you could have "<ta" in one template, and "ble>" in another template, and put them together to make a valid <table> tag. Previously the result would have been "<table>".
I'm not sure I grok the impact of these changes. Say you have template A defined as follows:
''italics'' '''open-bold <b
which is called from an article with text
Crazy: {{A}}r /> stuff'''.
What text does the parser see exactly? Will it see a mixture of both rendered and un-rendered HTML? ie:
Crazy: <i>italics</i> '''open-bold <br /> stuff'''.
I'm guessing not, because obviously not all HTML is valid input to the parser (unlike the <i> in this case). Would you mind explaining a bit more?
I was thinking this over and wondering whether there would be benefit in an explict tag in templates to mark code that should be passed unrendered through to its calling page. Something like <includeraw><ta</includeraw> But since I haven't really understood how the new PP works, this probably isn't necessary. ?
Thanks, Steve
On 11/21/07, Steve Bennett stevagewp@gmail.com wrote:
I'm not sure I grok the impact of these changes. Say you have template A defined as follows:
''italics'' '''open-bold <b
which is called from an article with text
Crazy: {{A}}r /> stuff'''.
What text does the parser see exactly? Will it see a mixture of both rendered and un-rendered HTML? ie:
Crazy: <i>italics</i> '''open-bold <br /> stuff'''.
I'm guessing not, because obviously not all HTML is valid input to the parser (unlike the <i> in this case). Would you mind explaining a bit more?
Well, as he says, uncovered main-pass syntax will now generally be valid. Bold and HTML are both handled in the main (post-template) pass, so you'd get
Crazy: <i>italics</i> <b>open-bold <br /> stuff</b>.
Even right now you can open wiki-italics or bold in one template, and close them in another. Or tables, etc. You can use chunks of markup only, if you like: Template:A = '' and template B = ' means {{a}}x{{a}} {{a}}{{b}}y{{b}}{{a}} = <i>x</i> <b>y</b>. See for yourself:
http://en.wikipedia.org/wiki/User:Simetrical/Apostrophes_and_templates
But this doesn't currently work for broken-up HTML tags.
On 11/22/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Well, as he says, uncovered main-pass syntax will now generally be valid. Bold and HTML are both handled in the main (post-template) pass, so you'd get
Crazy: <i>italics</i> <b>open-bold <br /> stuff</b>.
That's ok, I think the meaning *after* the parse phase is clear - I'm just wondering what text the parser is going to have to operate on. In the current in-place transformation parser, it's not really an issue mixing pre- and post-rendered text, because that's what it's doing constantly anyway. But you really don't want that in a recursive descent parser.
Even right now you can open wiki-italics or bold in one template, and close them in another. Or tables, etc. You can use chunks of markup only, if you like: Template:A = '' and template B = ' means {{a}}x{{a}} {{a}}{{b}}y{{b}}{{a}} = <i>x</i> <b>y</b>. See for yourself:
That's all fine, and that fits with the notion of "preprocessor": find all the template references, insert raw text in their place, and *then* parse. The parser never even knows it's happened.
However, pre-rendering some of the stuff to be transcluded is possibly more complicated. That's what I'm trying to find out.
But this doesn't currently work for broken-up HTML tags.
By accident, rather than design. There's no good reason it shouldn't be allowed (and obviously it's now fixed). Though when you're looking at the contents of the template directly, some special processing could be required.
Steve
On Wed, Nov 21, 2007 at 06:18:15PM -0500, Steve Sanbeg wrote:
Well, I'm not at all sure that my desire is unreasonable and your response, reasonable.
Very early in Steve's work, the issue was raised that one of the targets driving the effort of defining the language so that the parser could be reimplemented was the fact that it couldn't be a bad thing if people needing lightweight markup languages for other purposes could easily utilise mwtext for that.
My perspective is mostly, what is the data set tied to? I.e. a specification, a reusable component, or a set of applications.
Which data set? The language specification David seeks?
In that sense, it's probably not so much useful to create a parser to drop into other apps for its own sake, as that that would be complementary to less encumbered data.
More my point was that it seems more useful in the grand scale to make sure one's thinking about a language spec that's trimmable for smaller uses, than implementing an actual parser that can be dropped into other things... those other things are almost certainly not in PHP anyway. (If their implementers have any sense :-)
Cheers, -- jra
On 22/11/2007, Jay R. Ashworth jra@baylink.com wrote:
More my point was that it seems more useful in the grand scale to make sure one's thinking about a language spec that's trimmable for smaller uses, than implementing an actual parser that can be dropped into other things... those other things are almost certainly not in PHP anyway. (If their implementers have any sense :-)
A spec will be recompilable to anything useful, including a fast C preprocessor for those with access to gcc (i.e., probably not in the default MediaWiki download).
- d.
Steve Bennett wrote:
n 11/21/07, Tim Starling tstarling@wikimedia.org wrote:
<snip> > The parser pass order has changed from <snip> > * Template and extension tag parse to intermediate representation > * Template expansion and extension rendering > * HTML normalisation and security > * Main section...
<snip> > The intermediate representation I have used is a DOM document tree, taking
<snip>
Uncovered main-pass syntax, such as HTML tags, are now generally valid, whereas previously in some cases they were escaped. For example, you could have "<ta" in one template, and "ble>" in another template, and put them together to make a valid <table> tag. Previously the result would have been "<table>".
I'm not sure I grok the impact of these changes. Say you have template A defined as follows:
''italics'' '''open-bold <b
which is called from an article with text
Crazy: {{A}}r /> stuff'''.
What text does the parser see exactly? Will it see a mixture of both rendered and un-rendered HTML? ie:
Crazy: <i>italics</i> '''open-bold <br /> stuff'''.
I'm guessing not, because obviously not all HTML is valid input to the parser (unlike the <i> in this case). Would you mind explaining a bit more?
Apostrophes are converted to HTML in doAllQuotes(). Invalid HTML on input is cleaned up in removeHTMLtags(). Both are now considered to be *after* the preprocessor. So in your example, the preprocessor will produce:
Crazy: ''italics'' '''open-bold <br /> stuff'''.
I was thinking this over and wondering whether there would be benefit in an explict tag in templates to mark code that should be passed unrendered through to its calling page. Something like <includeraw><ta</includeraw> But since I haven't really understood how the new PP works, this probably isn't necessary. ?
The only thing that really needs escaping from the preprocessor are the characters "{|=}", and "<" when it occurs before the name of a registered tag hook. For "|" there is the old hack {{!}}, a template which contains just "|". This takes advantage of the uncovered syntax rules in the preprocessor to hide a character from the preprocessor, passing through a literal "|" to the main pass. It's used for table syntax. This mechanism could be extended and standardised, say with a "urldecode" parser function, to put any arbitrary character into the preprocessor output.
Tags such as <gallery> work by an uglier and more fragile method, i.e. with strip markers. Strip markers are placeholders passed through to the main pass, where they hopefully not mangled too badly. They are then substituted with their rendered value, potentially destroying whatever HTML the intervening passes put in their vicinity, then finally doBlockLevels() is run, which mangles their HTML output unless the tag hook writer carefully armoured it.
-- Tim Starling
On 11/23/07, Tim Starling tstarling@wikimedia.org wrote:
Apostrophes are converted to HTML in doAllQuotes(). Invalid HTML on input is cleaned up in removeHTMLtags(). Both are now considered to be *after* the preprocessor. So in your example, the preprocessor will produce:
Crazy: ''italics'' '''open-bold <br /> stuff'''.
Ok, that's good then.
The only thing that really needs escaping from the preprocessor are the characters "{|=}", and "<" when it occurs before the name of a registered tag hook. For "|" there is the old hack {{!}}, a template which contains just "|". This takes advantage of the uncovered syntax rules in the preprocessor to hide a character from the preprocessor, passing through a literal "|" to the main pass. It's used for table syntax. This mechanism could be extended and standardised, say with a "urldecode" parser function, to put any arbitrary character into the preprocessor output.
So, all those characters, if escaped, will appear to the main parser unescaped, i.e. as if they had been typed in directly. That's also good.
Tags such as <gallery> work by an uglier and more fragile method, i.e. with strip markers. Strip markers are placeholders passed through to the
Ok, does the same go for <ref>? I haven't yet seen a <gallery> be transcluded, but it could happen for <ref>. Presumably the parser will have to be able to recognise and ignore the strip markers. These are the "UNIQ" codes that get dotted in, yeah?
Steve
On Wed, 21 Nov 2007 22:50:04 -0500, Jay R. Ashworth wrote:
On Wed, Nov 21, 2007 at 06:18:15PM -0500, Steve Sanbeg wrote:
Well, I'm not at all sure that my desire is unreasonable and your response, reasonable.
Very early in Steve's work, the issue was raised that one of the targets driving the effort of defining the language so that the parser could be reimplemented was the fact that it couldn't be a bad thing if people needing lightweight markup languages for other purposes could easily utilise mwtext for that.
My perspective is mostly, what is the data set tied to? I.e. a specification, a reusable component, or a set of applications.
Which data set? The language specification David seeks?
Any of the XML dumps. It would be nice if they were tied to something simple, like that language spec, than a specific mediawiki configuration.
In that sense, it's probably not so much useful to create a parser to drop into other apps for its own sake, as that that would be complementary to less encumbered data.
More my point was that it seems more useful in the grand scale to make sure one's thinking about a language spec that's trimmable for smaller uses, than implementing an actual parser that can be dropped into other things... those other things are almost certainly not in PHP anyway. (If their implementers have any sense :-)
Yes, that would be a good thing.
Cheers, -- jra
Incidentally, is there a way to get the output of the preprocessor, for testing? Something like mediawiki.org/w/preprocess.php?Foo maybe?
Steve
Steve Bennett wrote:
Incidentally, is there a way to get the output of the preprocessor, for testing? Something like mediawiki.org/w/preprocess.php?Foo maybe?
Short answer is no.
There's Special:ExpandTemplates if you want something nice-looking, but it's a bit different to the preprocessor used for HTML mode. For testing during development I used a combination of eval.php and a short web-based script which I could change at will. Here it is in one of its forms:
<?php require( dirname(__FILE__) . '/includes/WebStart.php' );
$t = Title::newFromText( 'x' ); $o = new ParserOptions; $text = <<<EOT =={{xyz}}== EOT; header( 'Content-Type: text/xml' ); $dom = $wgParser->preprocessToDom( $text ); print $dom->saveXML();
At various times, I used preprocessToDom(), parse(), replaceVariables() and preSaveTransform() as parser entry points. srvus() is a purpose-made entry point for differential fuzz testing, present in both Parser and Parser_OldPP.
-- Tim Starling
wikitech-l@lists.wikimedia.org