Hello,
template expansion performance in Parsoid has made great strides recently. [[:en:Barack Obama]] now expands in relatively reasonable time and memory:
echo '{{:Barack Obama}}' | time node parse.js > obama.html
42.18user 2.04system 0:46.23elapsed 95%CPU (0avgtext+0avgdata 1528480maxresident)k 848inputs+0outputs (8major+691968minor)pagefaults 0swaps
Sample output: http://dev.wikidev.net/gabriel/tmp/obama.html
The time utility reports the maximum resident memory inflated by a factor of four (some bug in time afaik), so this works out to 382M resident max which matches with top. This is measured on my Intel i3 M370 laptop, and in the same ballpark as the PHP parser running on the cluster.
There are a few tokenizer issues visible in the output. Some of them, such as the Navbox templates near the bottom, contain constructs like this:
{{#if:foo|<tr><th><th style="foo;|<th style="}} bar;">
Supporting this would be very messy, so the template should be changed to use a saner nesting:
{{#if:foo|<tr><th>|}}<th style="{{#if:foo|foo;|}} bar">
We'll have to analyze how common this kind of mis-nesting is, and if we can perhaps fix this up automatically. See also http://www.mediawiki.org/wiki/Parsoid/Todo#Limitations.
Next bigger tasks I intend to tackle:
* Add full round-trip information, and preserve it through transformations and DOM tree creation
* Call back to action=parse to retrieve information we need from the wiki (link existence, image dimensions, many information-based parser functions and magic words, extensions).
* Create a [ DOM -> token -> WikiText ] serializer chain based on the existing serializer in the Visual Editor
Gabriel
wikitext-l@lists.wikimedia.org