[Wikitext-l] Fwd: [Wikitech-l] New preprocessor

Wed Nov 21 02:37:53 UTC 2007

---------- Forwarded message ----------
From: Tim Starling <tstarling at wikimedia.org>
Date: 21 Nov 2007 02:34
Subject: [Wikitech-l] New preprocessor
To: wikitech-l at lists.wikimedia.org

Brion said to me a couple of weeks ago "the parser is slow for large
articles, fix it". So along these lines, I have rewritten the preprocessor
phase to make it faster in PHP. I also have plans for further speed
improvement via a partial port to C.

This work was planned and started before the recent parser discussions on
wikitech-l, by Steve Bennett et al. I chose to ignore those discussions to
improve my productivity. Apologies if I'm stepping on any toes.

I'll cover the technical side of this first, and then the impact for the
user in terms of wikitext syntax change.

This text is mostly adapted from my entry in RELEASE-NOTES.

== Technical viewpoint ==

The parser pass order has changed from

    * Extension tag strip and render
    * HTML normalisation and security
    * Template expansion
    * Main section...

to

    * Template and extension tag parse to intermediate representation
    * Template expansion and extension rendering
    * HTML normalisation and security
    * Main section...

The new two-pass preprocessor can skip "dead branches" in template
expansion, such as unfollowed #if cases and unused defaults for template
arguments. This provides a significant performance improvement in
template-heavy test cases taken from  Wikipedia. Parser function hooks can
participate in this performance improvement by using the new
SFH_OBJECT_ARGS flag during registration.

The intermediate representation I have used is a DOM document tree, taking
advantage of PHP's standard access to libxml's efficient tree structures.
I construct the tree via an XML text stage, although it could be done
directly with DOM. My gut feeling was that the XML implementation would be
faster, but I've made the interfaces such that it could be done either
way. The XML form is not exposed.

One reason for using an intermediate representation is so that the parse
results for templates can be cached. The theory is that the cached results
can then be used to efficiently expand templates with changeable
arguments, such as {{cite web}}. ( There's also an expansion cache for
templates expanded with no arguments, such as {{•}}. )

Another reason is that I couldn't see any efficient (O(N) worst-case time
order) way to implement dead branch elimination without an intermediate
representation.

The pre-expand include size limit has been removed, since there's no
efficient way to calculate such a figure, and it would now be meaningless
for performance anyway. The "preprocessor node count" takes its place,
with a generous default limit.

The context in which XML-style extension tags are called has changed, so
extensions which make use of the parser state may need compatibility
changes. Since extension tags are now rendered simultaneously with
template expansion, there is a possibility for future improvement of the
extension tag interface. For example, we could have
preprocessor-transparent tags which act like parser functions, and we
could give extension tags access to the template arguments (i.e. triple
brace expansion).

== User viewpoint ==

The main effect of this for the user is that the rules for uncovered
syntax have changed.

Uncovered main-pass syntax, such as HTML tags, are now generally valid,
whereas previously in some cases they were escaped. For example, you could
have "<ta" in one template, and "ble>" in another template, and put them
together to make a valid <table> tag. Previously the result would have
been "&lt;table&gt;".

Uncovered preprocessor syntax is generally not recognised. For example, if
you have "{{a" in Template:A and "b}}" in Template:B, then "{{a}}{{b}}"
will be converted to a literal "{{ab}}" rather than the contents of
Template:Ab. This was the case previously in HTML output mode, and is now
uniformly the case in  the other modes as well. HTML-style comments
uncovered by template expansion will not be recognised by the preprocessor
and hence will not prevent template expansion within them, but they will
be stripped by the following HTML security  pass.

The rules for template expansion during message transformation were
counterintuitive, mostly accidental and buggy. There are a few small
changes in this version: for example, templates with dynamic names, as in
"{{ {{a}} }}", are fully expanded as they are in HTML mode, whereas
previously only the inner template was expanded. I'd like to make some
larger breaking changes to message transformation, after a review of
typical use cases.

The header identification routines for section edit and for numbering
section edit links have been merged. This removes a significant failure
mode and fixes a whole category of bugs (tracked by bug #4899). Wikitext
headings uncovered by template expansion or comment removal will still be
rendered into a heading tag, and will get an entry in the TOC, but will
not have a section edit link.  HTML-style headings will also not have a
section edit link. Valid wikitext headings present in the template source
text will get a template section edit  link. This is a major break from
previous behaviour, but I believe the effects  are almost entirely beneficial.

-- Tim Starling

_______________________________________________
Wikitech-l mailing list
Wikitech-l at lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l