Jens Frank wrote
You should perhaps have a look at 1.3 first. Parts of
the Parser are
already a real parser, reading the wikitext in one pass, character by
character. See Tokenizer.php and its use in Parser.php. This work is
not yet completed, so the regexes still exist for some parts of the
markup.
I hadn't seen this bit of the parser. Last time I looked at it, it was still
splitting the string using regexes. When I saw the way you do it currently,
I have to admit I went into a bit of a panic. In my experience reading a
large string character by character in a high level language is a very bad
idea. Indeed, our "parser" to date has gone to some lengths to avoid this,
using regexes in all sorts of contrived ways to avoid executing a number of
PHP lines proportional to the number of characters.
After I calmed down, I fixed up the profiler and did a couple of runs.
Gabriel Wicke did some too, using ab. They're at:
http://meta.wikipedia.org/wiki/Profiling
They show that the page view time for the current CVS HEAD is double what it
was in 1.2.5. The parser itself was rougly 2.4 times slower.
This is completely unacceptable considering the current state of our web
serving hardware. The latest batch of 1U servers won't cover the penalty
from upgrading to 1.3. Our web servers are not keeping up with demand in
peak times as it is, during peak times their queues all overflow, giving
users random error messages.
The current plan is to revert the tokenizer sections of the parser back to
something similar to 1.2. Hopefully we'll get it working soon, since the
Board vote feature I've written is a 1.3 extension and voting is meant to
start in 4 days.
-- Tim Starling