---------- Forwarded message ---------- From: Tim Starling tstarling@wikimedia.org Date: 21 Nov 2007 02:34 Subject: [Wikitech-l] New preprocessor To: wikitech-l@lists.wikimedia.org
Brion said to me a couple of weeks ago "the parser is slow for large articles, fix it". So along these lines, I have rewritten the preprocessor phase to make it faster in PHP. I also have plans for further speed improvement via a partial port to C.
This work was planned and started before the recent parser discussions on wikitech-l, by Steve Bennett et al. I chose to ignore those discussions to improve my productivity. Apologies if I'm stepping on any toes.
I'll cover the technical side of this first, and then the impact for the user in terms of wikitext syntax change.
This text is mostly adapted from my entry in RELEASE-NOTES.
== Technical viewpoint ==
The parser pass order has changed from
* Extension tag strip and render * HTML normalisation and security * Template expansion * Main section...
to
* Template and extension tag parse to intermediate representation * Template expansion and extension rendering * HTML normalisation and security * Main section...
The new two-pass preprocessor can skip "dead branches" in template expansion, such as unfollowed #if cases and unused defaults for template arguments. This provides a significant performance improvement in template-heavy test cases taken from Wikipedia. Parser function hooks can participate in this performance improvement by using the new SFH_OBJECT_ARGS flag during registration.
The intermediate representation I have used is a DOM document tree, taking advantage of PHP's standard access to libxml's efficient tree structures. I construct the tree via an XML text stage, although it could be done directly with DOM. My gut feeling was that the XML implementation would be faster, but I've made the interfaces such that it could be done either way. The XML form is not exposed.
One reason for using an intermediate representation is so that the parse results for templates can be cached. The theory is that the cached results can then be used to efficiently expand templates with changeable arguments, such as {{cite web}}. ( There's also an expansion cache for templates expanded with no arguments, such as {{•}}. )
Another reason is that I couldn't see any efficient (O(N) worst-case time order) way to implement dead branch elimination without an intermediate representation.
The pre-expand include size limit has been removed, since there's no efficient way to calculate such a figure, and it would now be meaningless for performance anyway. The "preprocessor node count" takes its place, with a generous default limit.
The context in which XML-style extension tags are called has changed, so extensions which make use of the parser state may need compatibility changes. Since extension tags are now rendered simultaneously with template expansion, there is a possibility for future improvement of the extension tag interface. For example, we could have preprocessor-transparent tags which act like parser functions, and we could give extension tags access to the template arguments (i.e. triple brace expansion).
== User viewpoint ==
The main effect of this for the user is that the rules for uncovered syntax have changed.
Uncovered main-pass syntax, such as HTML tags, are now generally valid, whereas previously in some cases they were escaped. For example, you could have "<ta" in one template, and "ble>" in another template, and put them together to make a valid <table> tag. Previously the result would have been "<table>".
Uncovered preprocessor syntax is generally not recognised. For example, if you have "{{a" in Template:A and "b}}" in Template:B, then "{{a}}{{b}}" will be converted to a literal "{{ab}}" rather than the contents of Template:Ab. This was the case previously in HTML output mode, and is now uniformly the case in the other modes as well. HTML-style comments uncovered by template expansion will not be recognised by the preprocessor and hence will not prevent template expansion within them, but they will be stripped by the following HTML security pass.
The rules for template expansion during message transformation were counterintuitive, mostly accidental and buggy. There are a few small changes in this version: for example, templates with dynamic names, as in "{{ {{a}} }}", are fully expanded as they are in HTML mode, whereas previously only the inner template was expanded. I'd like to make some larger breaking changes to message transformation, after a review of typical use cases.
The header identification routines for section edit and for numbering section edit links have been merged. This removes a significant failure mode and fixes a whole category of bugs (tracked by bug #4899). Wikitext headings uncovered by template expansion or comment removal will still be rendered into a heading tag, and will get an entry in the TOC, but will not have a section edit link. HTML-style headings will also not have a section edit link. Valid wikitext headings present in the template source text will get a template section edit link. This is a major break from previous behaviour, but I believe the effects are almost entirely beneficial.
-- Tim Starling
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l