Wikimedia developers wikitech-l@Wikipedia.org writes:
Nick Pisarro wrote:
The current parser, which performs dozens of passes, probably degrades by the square of the file size.
Really? All the regular expressions I've seen should be possible in O(N) time. There's no PHP loops which loop through every character, just through certain kinds of entities such as every link. I would have thought that 14 passes at O(N) still produces O(N). Oh well, I'm not a computer scientist, what would I know.
I had to think about this for a few days:
Doing the scan is linear, however putting together the output string, is not.
The time to do a single concatenate string operation can be expected to go up linearly with the total length of the strings being concatenated. The number of concatenates can be expected to go up linearly by the length of the string, assuming a roughly constant rate of markups/byte. So the time to put together the wikified output is going to go up by the product of the two operations.
So, the time to put together an article is going to be: TotalTime = ScanTimePerByte * length + AssemblyTimePerByte * length ^ 2.
So, a 30,000 byte article is going to take 10,000 times as long to assemble as 300 byte article. Based on the values of 'ScanTimePerByte' and 'AssemblyTimePerByte' in the above formula, at some point, the assembly time is going to swamp out the scan time.
So, the question then becomes, how does one put together the output string in a way that goes up linearly with the length and not by the square of the length. Doing "Append" operations rather than concatenates should go up at a rate that is fairly linear. But there is no string append. You can do an append to an array, however.
An idea here is that instead of gluing the output string together as one goes, append each piece of output string as individual elements of an array. When one is finished, use the PHP "implode" function, with no 'glue' argument, to do the final assembly.
If the "implode" operations is well written, it may not be, before it starts, it should add up the total number of bytes it will need for the result string. Then it should allocate the space required in one operation and assemble the final string by moving everything in at once.
The current development project I am doing in my "day" job, is being done done in Visual Basic .Net--upgraded from VB 6. (No comments please). Our product was choking when assembling certain .html files. It turns out that the problem was found to be assembling an output buffer by doing up to 25,000 byte by byte concatenates. By using a new object in VB .Net called a "StringBuilder", which allows doing appends to a buffer, the process time for this operation dropped in insignificance.
An example of my suggested idea is optimizing such an operation done in the new Parser object. The parser object turns the string to be wikified into tokens in one fell swoop, an excellent optimization, but then builds the result by doing "$s .= $txt;" operations repeatedly. Instead, they should be replaced by "$sIntermediate[] = $txt".
When finished, 'replaceInternalLinks' can do a 'return implode("", $sIntermediate);'.
There are other places in the parse where similar operations exist. Certainly, since Internet protocols can't be nested, the function 'replaceExternalLinks' could be rewritten to assemble an output array then implode it, rather than doing many square law Replace operations using the $unique strings.
Another place where $unique is used, a horrible kludge (sorry to the author), is handling <nowiki>, <pre>, and <math> tags. Since tags processed by 'doWikiPass2' can not cross <nowiki>... tags [is this allowed, or if so, desirable?], the processing of those tags could blow the text into individual strings in an array, just do a 'doWikiPass2' on the appropriate pieces and assemble the final product with an implode.
There may be similar optimizations in table processing, which I have not looked at.
I have to go do my "day" job now, but I'll try out some of these things the next time I get, unless someone gets to this before I do. Please let me know if one of you starts trying these ideas out before I can get to it.
Nick Pisarro