Hello,
template expansion performance in Parsoid has made great strides
recently. [[:en:Barack Obama]] now expands in relatively reasonable time
and memory:
echo '{{:Barack Obama}}' | time node parse.js > obama.html
42.18user 2.04system 0:46.23elapsed 95%CPU (0avgtext+0avgdata
1528480maxresident)k
848inputs+0outputs (8major+691968minor)pagefaults 0swaps
Sample output:
http://dev.wikidev.net/gabriel/tmp/obama.html
The time utility reports the maximum resident memory inflated by a
factor of four (some bug in time afaik), so this works out to 382M
resident max which matches with top. This is measured on my Intel i3
M370 laptop, and in the same ballpark as the PHP parser running on the
cluster.
There are a few tokenizer issues visible in the output. Some of them,
such as the Navbox templates near the bottom, contain constructs like this:
{{#if:foo|<tr><th><th style="foo;|<th style="}}
bar;">
Supporting this would be very messy, so the template should be changed
to use a saner nesting:
{{#if:foo|<tr><th>|}}<th style="{{#if:foo|foo;|}} bar">
We'll have to analyze how common this kind of mis-nesting is, and if we
can perhaps fix this up automatically. See also
http://www.mediawiki.org/wiki/Parsoid/Todo#Limitations.
Next bigger tasks I intend to tackle:
* Add full round-trip information, and preserve it through
transformations and DOM tree creation
* Call back to action=parse to retrieve information we need from the
wiki (link existence, image dimensions, many information-based parser
functions and magic words, extensions).
* Create a [ DOM -> token -> WikiText ] serializer chain based on the
existing serializer in the Visual Editor
Gabriel