RE: [Wikitech-l] Re: Faster parsing

List overview All Threads
Download

newer

older

tool bar

Webalizer job broken / generates...

Poor, Edmund W

26 Feb 2004 26 Feb '04

4:17 p.m.

I agree with Brion, and I'm sorry I replied to Timwi before reading Brion's "I forbid" post.

I've done profiling on Visual Basic 4.0 & 5.0, MS SQL Server 6.5, and Java; and let me tell you I'm frequently amazed at the kinds of things that cause slow spots. "But that can't be true!" is the usual response of my colleagues. Then, you show them the profiling data, and they're like, "Well, I'll be durned. Who would have known?"

Anyone know how to profile MySQL code?

Ed Poor

Show replies by date

Ivo Köthnig

26 Feb 26 Feb

11:43 p.m.

New subject: Faster parsing

Am Donnerstag, 26. Februar 2004 17:17 schrieb Poor, Edmund W:

...

I agree with Brion, and I'm sorry I replied to Timwi before reading Brion's "I forbid" post.

I've done profiling on Visual Basic 4.0 & 5.0, MS SQL Server 6.5, and Java; and let me tell you I'm frequently amazed at the kinds of things that cause slow spots. "But that can't be true!" is the usual response of my colleagues. Then, you show them the profiling data, and they're like, "Well, I'll be durned. Who would have known?"

Anyone know how to profile MySQL code?

You should keep in mind that profiling does not always show you the code which is slow or should be improved! Often bad code occures on diffrent places.

For example one function, routine, methode or what ever calls another very often. The profiler will show you the often called function. But the best way is not to improve the called function. The best way is to optimize the calling function that way, that it does not need to call the otherone so often.

Thus you always have to understand your code, even if you profile it. And if you really undersand your code you seldom need to profile it to know were to improve it.

--Ivo Köthnig

Tim Starling

27 Feb 27 Feb

12:10 a.m.

New subject: Faster parsing

Poor, Edmund W wrote:

...

Anyone know how to profile MySQL code?

Considering the amount of time Snok and I have spent writing and using MediaWiki's profiling code, and considering the amount of time Brion has spent on database optimisation (to great effect), you'll forgive me for being slightly offended.

We've profiled, we've cached, we've optimised. We've discovered that CPU load is generally the bottleneck when viewing pages, and hence spent time on parser optimisation and several kinds of caching. We have 4 web servers and one database server and still the web servers are heavily loaded.

Timwi wrote:

...

My guess is that the slowest part of it is checking whether a page exists, and if it does, checking its size (if the user has set the preference that shows stubs in a different colour), because both of this requires a database query.

What, even with the linkscc cache and the memcached link cache? If you say so.

Nick Pisarro wrote:

...

The current parser, which performs dozens of passes, probably degrades by the square of the file size.

Really? All the regular expressions I've seen should be possible in O(N) time. There's no PHP loops which loop through every character, just through certain kinds of entities such as every link. I would have thought that 14 passes at O(N) still produces O(N). Oh well, I'm not a computer scientist, what would I know.

...

Storing diffs in the 'old' table. This would not affect performance, except when loading or comparing old revisions, but could drastically reduce the size of the database, which has to benefit how manageable it is. There already is a differencing engine in the source, though I'm not sure how reliable it is--it may also degrade by the square of the file difference. Here too, a sequence of diffs can be merged in one pass. Having written such code in the past, I plan to create a write up exploring this idea. Has this idea been discussed amongst the developers? What are the gotcha's?

See http://meta.wikipedia.org/wiki/History_compression

Diffs have not been extensively speed tested, but seemed to be reasonably fast when I was running my space tests. They produce good compression in the talk page or incremental improvement situation, but as expected, perform poorly in the edit war situation. As noted by Rob Hooft in this post:

http://mail.wikipedia.org/pipermail/wikitech-l/2004-February/008385.html

it should be possible to improve compression in this case by trying diffs with a few different revisions. As long as the time required per diff is in the tens of milliseconds or less, this should be feasible.

-- Tim Starling

Timwi

12:53 a.m.

New subject: Faster parsing

Tim Starling wrote:

...

Timwi wrote:

...
My guess is that the slowest part of it is checking whether a page exists, and if it does, checking its size (if the user has set the preference that shows stubs in a different colour), because both of this requires a database query.

What, even with the linkscc cache and the memcached link cache? If you say so.

I apologise if my comment was in any way offensive to you, but please do take note of the fact that (a) I said it was a guess; (b) I did mention somewhere else that I have no real idea to what extent memcached is already being used; (c) I have not attacked you, or even addressed you at all.

With that said, please may I humbly ask what "the linkscc cache" actually caches? What exactly is stored in each memcache key here?

...

Nick Pisarro wrote:

...
The current parser, which performs dozens of passes, probably degrades by the square of the file size.

Really? All the regular expressions I've seen should be possible in O(N) time. There's no PHP loops which loop through every character, just through certain kinds of entities such as every link. I would have thought that 14 passes at O(N) still produces O(N). Oh well, I'm not a computer scientist, what would I know.

Our current parser is most probably O(n), but with a high constant factor. The time complexity of an algorithm is rarely useful as a measurement of its efficiency, especially on data of approximately constant size.

...

See http://meta.wikipedia.org/wiki/History_compression

Isn't history compression going to be detrimental to CPU usage rather than beneficial? I am still finding it hard to understand why so many people here feel that history compression is necessary.

Timwi

Gabriel Wicke

1:51 a.m.

New subject: Faster parsing

On Fri, 27 Feb 2004 00:53:07 +0000, Timwi wrote:

...

Isn't history compression going to be detrimental to CPU usage rather than beneficial? I am still finding it hard to understand why so many people here feel that history compression is necessary.

The DB is i/o bound currently. And the history size will grow, grow, grow.

-- Gabriel Wicke

Tim Starling

1:55 a.m.

New subject: Faster parsing

Timwi wrote:

...

Tim Starling wrote:

...
Timwi wrote:

...
My guess is that the slowest part of it is checking whether a page exists, and if it does, checking its size (if the user has set the preference that shows stubs in a different colour), because both of this requires a database query.

What, even with the linkscc cache and the memcached link cache? If you say so.

I apologise if my comment was in any way offensive to you, but please do take note of the fact that (a) I said it was a guess; (b) I did mention somewhere else that I have no real idea to what extent memcached is already being used; (c) I have not attacked you, or even addressed you at all.

You're misjudging my tone, I'm not offended. I'm just making fun of you :)

...

With that said, please may I humbly ask what "the linkscc cache" actually caches? What exactly is stored in each memcache key here?

linkscc is a database table which stores a serialised, compressed LinkCache object. This contains an array of "good" links, an array of "bad" links, and an array of image links from a given page. Hence when a page is loaded the script only needs one DB query to find out which links exist. It turns out that only a small proportion of users have a stub threshold. Viewing pages with a stub threshold does indeed require lots of DB queries.

In addition, the memcached "lc" keys store an article ID for each page title. This improves efficiency where linkscc is not used, for example when saving a page.

...

Isn't history compression going to be detrimental to CPU usage rather than beneficial? I am still finding it hard to understand why so many people here feel that history compression is necessary.

It's detrimental to CPU usage, but it reduces I/O on the database machine, allows RAM caching of more articles, and reduces the bandwidth required for backups and slave DB synchronisation. If compression and decompression can be done in a few milliseconds of CPU time, it's likely that it will be beneficial overall.

Although CPU is dominant for page views, for many other features it is the database which is dominant. When viewing history, the database is the major bottleneck.

-- Tim Starling

Nick Pisarro

3 Mar 3 Mar

2:12 p.m.

New subject: Faster parsing

Wikimedia developers wikitech-l@Wikipedia.org writes:

...

Nick Pisarro wrote:

...
The current parser, which performs dozens of passes, probably degrades by the square of the file size.

Really? All the regular expressions I've seen should be possible in O(N) time. There's no PHP loops which loop through every character, just through certain kinds of entities such as every link. I would have thought that 14 passes at O(N) still produces O(N). Oh well, I'm not a computer scientist, what would I know.

I had to think about this for a few days:

Doing the scan is linear, however putting together the output string, is not.

The time to do a single concatenate string operation can be expected to go up linearly with the total length of the strings being concatenated. The number of concatenates can be expected to go up linearly by the length of the string, assuming a roughly constant rate of markups/byte. So the time to put together the wikified output is going to go up by the product of the two operations.

So, the time to put together an article is going to be: TotalTime = ScanTimePerByte * length + AssemblyTimePerByte * length ^ 2.

So, a 30,000 byte article is going to take 10,000 times as long to assemble as 300 byte article. Based on the values of 'ScanTimePerByte' and 'AssemblyTimePerByte' in the above formula, at some point, the assembly time is going to swamp out the scan time.

So, the question then becomes, how does one put together the output string in a way that goes up linearly with the length and not by the square of the length. Doing "Append" operations rather than concatenates should go up at a rate that is fairly linear. But there is no string append. You can do an append to an array, however.

An idea here is that instead of gluing the output string together as one goes, append each piece of output string as individual elements of an array. When one is finished, use the PHP "implode" function, with no 'glue' argument, to do the final assembly.

If the "implode" operations is well written, it may not be, before it starts, it should add up the total number of bytes it will need for the result string. Then it should allocate the space required in one operation and assemble the final string by moving everything in at once.

The current development project I am doing in my "day" job, is being done done in Visual Basic .Net--upgraded from VB 6. (No comments please). Our product was choking when assembling certain .html files. It turns out that the problem was found to be assembling an output buffer by doing up to 25,000 byte by byte concatenates. By using a new object in VB .Net called a "StringBuilder", which allows doing appends to a buffer, the process time for this operation dropped in insignificance.

An example of my suggested idea is optimizing such an operation done in the new Parser object. The parser object turns the string to be wikified into tokens in one fell swoop, an excellent optimization, but then builds the result by doing "$s .= $txt;" operations repeatedly. Instead, they should be replaced by "$sIntermediate[] = $txt".

When finished, 'replaceInternalLinks' can do a 'return implode("", $sIntermediate);'.

There are other places in the parse where similar operations exist. Certainly, since Internet protocols can't be nested, the function 'replaceExternalLinks' could be rewritten to assemble an output array then implode it, rather than doing many square law Replace operations using the $unique strings.

Another place where $unique is used, a horrible kludge (sorry to the author), is handling <nowiki>, <pre>, and <math> tags. Since tags processed by 'doWikiPass2' can not cross <nowiki>... tags [is this allowed, or if so, desirable?], the processing of those tags could blow the text into individual strings in an array, just do a 'doWikiPass2' on the appropriate pieces and assemble the final product with an implode.

There may be similar optimizations in table processing, which I have not looked at.

I have to go do my "day" job now, but I'll try out some of these things the next time I get, unless someone gets to this before I do. Please let me know if one of you starts trying these ideas out before I can get to it.

Nick Pisarro

2:57 p.m.

New subject: Faster parsing--Whoops

Nick Pisarro writes:

...

by doing "$s .= $txt;" operations

Whoops!

"$s .= $txt" is a string append operation!!. I used a poor example.

So a question becomes is how PHP treats this operation and/or 'implode'.

It may treat "$s .= $txt' internally as "$s = $s . $txt" or it may be smarter than that. Likewise 'implode' could do the whole operation at once, or, it could it break down into a series of appends or concatenates.

Many garbage collecting string handling languages would tend to never modify existing strings, but always create new ones. Short of finding an ultra PHP maven who knows the answer to these questions, some timing tests of the functions might be required to figure this out.

While the "$s .= $txt" may not be a good example, there are other places where replaces or true concatenates are done in MediaWiki that could be optimized.

Nick Pisarro

Timwi

4 Mar 4 Mar

2:09 a.m.

New subject: Faster parsing

Nick Pisarro wrote:

...

Wikimedia developers wikitech-l@Wikipedia.org writes:

...
Really? All the regular expressions I've seen should be possible in O(N) time.

I had to think about this for a few days: Doing the scan is linear, however putting together the output string, is not.

You're probably right, and our current parser does turn out to be O(n^2).

...

I have to go do my "day" job now, but I'll try out some of these things the next time I get, unless someone gets to this before I do. Please let me know if one of you starts trying these ideas out before I can get to it.

It would probably be wiser to just simply write a new parser that does it the "correct" way (lexing, parse tree generation, translation).

Timwi

7445

Age (days ago)

7452

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

6 participants

tags (0)

participants (6)

Gabriel Wicke
Ivo Köthnig
Nick Pisarro
Poor, Edmund W
Tim Starling
Timwi