Wikimedia developers <wikitech-l(a)Wikipedia.org> writes:
Nick Pisarro wrote:
The current parser, which performs dozens of
passes, probably degrades
by the square of the file size.
Really? All the regular expressions I've seen should be possible in O(N)
time. There's no PHP loops which loop through every character, just
through certain kinds of entities such as every link. I would have
thought that 14 passes at O(N) still produces O(N). Oh well, I'm not a
computer scientist, what would I know.
I had to think about this for a few days:
Doing the scan is linear, however putting together the output string, is
not.
The time to do a single concatenate string operation can be expected to go
up linearly with the total length of the strings being concatenated. The
number of concatenates can be expected to go up linearly by the length of
the string, assuming a roughly constant rate of markups/byte. So the time
to put together the wikified output is going to go up by the product of
the two operations.
So, the time to put together an article is going to be: TotalTime =
ScanTimePerByte * length + AssemblyTimePerByte * length ^ 2.
So, a 30,000 byte article is going to take 10,000 times as long to
assemble as 300 byte article. Based on the values of 'ScanTimePerByte' and
'AssemblyTimePerByte' in the above formula, at some point, the assembly
time is going to swamp out the scan time.
So, the question then becomes, how does one put together the output string
in a way that goes up linearly with the length and not by the square of
the length. Doing "Append" operations rather than concatenates should go
up at a rate that is fairly linear. But there is no string append. You can
do an append to an array, however.
An idea here is that instead of gluing the output string together as one
goes, append each piece of output string as individual elements of an
array. When one is finished, use the PHP "implode" function, with no
'glue' argument, to do the final assembly.
If the "implode" operations is well written, it may not be, before it
starts, it should add up the total number of bytes it will need for the
result string. Then it should allocate the space required in one operation
and assemble the final string by moving everything in at once.
The current development project I am doing in my "day" job, is being done
done in Visual Basic .Net--upgraded from VB 6. (No comments please). Our
product was choking when assembling certain .html files. It turns out that
the problem was found to be assembling an output buffer by doing up to
25,000 byte by byte concatenates. By using a new object in VB .Net called
a "StringBuilder", which allows doing appends to a buffer, the process
time for this operation dropped in insignificance.
An example of my suggested idea is optimizing such an operation done in
the new Parser object. The parser object turns the string to be wikified
into tokens in one fell swoop, an excellent optimization, but then builds
the result by doing "$s .= $txt;" operations repeatedly. Instead, they
should be replaced by "$sIntermediate[] = $txt".
When finished, 'replaceInternalLinks' can do a 'return implode("",
$sIntermediate);'.
There are other places in the parse where similar operations exist.
Certainly, since Internet protocols can't be nested, the function
'replaceExternalLinks' could be rewritten to assemble an output array then
implode it, rather than doing many square law Replace operations using the
$unique strings.
Another place where $unique is used, a horrible kludge (sorry to the
author), is handling <nowiki>, <pre>, and <math> tags. Since tags
processed by 'doWikiPass2' can not cross <nowiki>... tags [is this
allowed, or if so, desirable?], the processing of those tags could blow
the text into individual strings in an array, just do a 'doWikiPass2' on
the appropriate pieces and assemble the final product with an implode.
There may be similar optimizations in table processing, which I have not
looked at.
I have to go do my "day" job now, but I'll try out some of these things
the next time I get, unless someone gets to this before I do. Please let
me know if one of you starts trying these ideas out before I can get to it.
Nick Pisarro