According to the hardware orders page[1], a major bottleneck is page rendering.
Does any phase of this stand out? DB fetch? Text parse? Request/Response bandwidth?
I assume profiling on this has been done quite a lot...
[1] http://meta.wikimedia.org/wiki/Hardware_ordered_August_30%2C_2005
Jeremy Dunck wrote:
According to the hardware orders page[1], a major bottleneck is page rendering.
Does any phase of this stand out? DB fetch? Text parse? Request/Response bandwidth?
Depends on the page. On most pages the biggest chunks of rendering time are in handling links. (On most pages most of the markup is links, so...)
On pages with lots of eg tables or whatever that can end up eating some time as well of course.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Jeremy Dunck wrote:
According to the hardware orders page[1], a major bottleneck is page rendering.
Does any phase of this stand out? DB fetch? Text parse? Request/Response bandwidth?
Depends on the page. On most pages the biggest chunks of rendering time are in handling links. (On most pages most of the markup is links, so...)
Do we still query the database for all these links to decide whether they exist and (depending on user settings) whether they are stubs? As I recall, someone once had an idea to write a simple, light-weight daemon for this that uses a dictionary data structure (I presume they were thinking of a DAWG (directed acyclic word graph)) for this.
There are many algorithms available to generate DAWGs; I wrote one in Java once, but it's slow, you can probably find a faster one in a faster language. Or has this idea been scrapped for some reason?
Timwi
On 9/10/05, Timwi timwi@gmx.net wrote:
Do we still query the database for all these links to decide whether they exist and (depending on user settings) whether they are stubs? As I recall, someone once had an idea to write a simple, light-weight daemon for this that uses a dictionary data structure (I presume they were thinking of a DAWG (directed acyclic word graph)) for this.
I'm not familiar with DAWG, but why wouldn't a nested sparse hash work?
link[from-id][to-title] = 0 if no link, 1 if link.
Jeremy Dunck wrote:
On 9/10/05, Timwi timwi@gmx.net wrote:
Do we still query the database for all these links to decide whether they exist and (depending on user settings) whether they are stubs? As I recall, someone once had an idea to write a simple, light-weight daemon for this that uses a dictionary data structure (I presume they were thinking of a DAWG (directed acyclic word graph)) for this.
I'm not familiar with DAWG, but why wouldn't a nested sparse hash work? link[from-id][to-title] = 0 if no link, 1 if link.
Well, firstly, we're not interested in links between pages here, but merely in whether a page exists and if so, whether a page is a "stub".
The answer to your question is that a hash table uses much more memory. It says on the page you linked to that it has "2 bits per entry overhead", so the entire hash table would be the size of the list of all article titles plus 2 bits times the number of articles. A DAWG is usually much smaller than the size of the list of all entries.
The downside, of course, is that in a hash you can store any value with all the keys, while a DAWG is just a binary (yes/no) decider.
Timwi
Timwi wrote:
Brion Vibber wrote:
Jeremy Dunck wrote:
According to the hardware orders page[1], a major bottleneck is page rendering.
Does any phase of this stand out? DB fetch? Text parse? Request/Response bandwidth?
Depends on the page. On most pages the biggest chunks of rendering time are in handling links. (On most pages most of the markup is links, so...)
Do we still query the database for all these links to decide whether they exist and (depending on user settings) whether they are stubs? As I recall, someone once had an idea to write a simple, light-weight daemon for this that uses a dictionary data structure (I presume they were thinking of a DAWG (directed acyclic word graph)) for this.
There are many algorithms available to generate DAWGs; I wrote one in Java once, but it's slow, you can probably find a faster one in a faster language. Or has this idea been scrapped for some reason?
I considered implementing a cache for link existence, but when we realised it could be done with a quite satisfactory speed, using the cur (now page) table only, the robustness gains convinced me that it was the best solution. External daemons necessarily bring a higher system administration overhead. Caches, especially ones which don't respect database transactions, tend to develop inconsistencies.
I think Brion meant replaceInternalLinks (7.9%) and the CPU component of replaceLinkHolders (3.2%), not the 0.75% of execution time now required for link lookup within the parser.
It's true that the addLinkObj query still takes 5.5%. In my opinion, that can be reduced to near zero by implementing batch lookup techniques ubiquitously.
-- Tim Starling
Jeremy Dunck wrote:
According to the hardware orders page[1], a major bottleneck is page rendering.
Does any phase of this stand out? DB fetch? Text parse? Request/Response bandwidth?
I assume profiling on this has been done quite a lot...
[1] http://meta.wikimedia.org/wiki/Hardware_ordered_August_30%2C_2005
See http://meta.wikimedia.org/wiki/Profiling/20050822. I don't think I'd call any of it a bottleneck, optimisation aimed at reducing average page view time or CPU load is hard work. I suspect there are hotspots, but not in our code, in the PHP VM. I think the best way to reduce page view time at this stage would be to optimise that, or to produce a JIT compiler.
It's much easier to produce user-visible speed improvements by concentrating on relatively rare but slow operations. These often account for a similar amount of the average request time as do the parser "hotspots". And they've often been neglected by past optimisation work.
For example, I notice on the profiling results I posted above, the most expensive database query is from oaiUpdatePage(). That's not a page view query at all, it's a save query. It's a relatively new feature, and it probably hasn't been examined in the optimisation context. It's apparently adding 1750ms to save times, which is a huge amount, easily user-visible.
Another similar possibility is to work on optimisation of pathological page views, rather than typical page views. It's easy to construct a page that takes more than 30s to render. I recently did some work on pathological diffs and pathological image scaling, and in both cases acheived a speedup on the order of 100x for the test case.
Optimisations such as these do reduce our costs, but they also improve user experience -- disproportionately so, compared to a 5% reduction in time for an already subjectively fast operation. They also improve robustness -- something which adding more hardware cannot do.
-- Tim Starling
On 9/7/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Jeremy Dunck wrote:
According to the hardware orders page[1], a major bottleneck is page rendering.
See http://meta.wikimedia.org/wiki/Profiling/20050822. I don't think I'd call any of it a bottleneck, optimisation aimed at reducing average page view time or CPU load is hard work. I suspect there are hotspots, but not in our code, in the PHP VM. I think the best way to reduce page view time at this stage would be to optimise that, or to produce a JIT compiler.
I've often wondered this, so this is a great opportunity to jump in. Why not cache prerendered versions of all pages? It would seem that the majority of hits are reads. One approach I've seen elsewhere is to cache a page the first time it's loaded, and then have writes invalidate the cache. (That way you're not caching pages nobody looks at.)
It seems the total content of wikipedia isn't that big.
One tricky part is that writes to page B can affect page A if page A has a link to B. A reverse index of links would solve this, though I don't know how bit it'd be.
(PS: Sorry to be a back seat driver -- I know how frustrating it is when someone else tries to tell you they know your problem space better than you do. So if this is a stupid suggestion, feel free to tell me so; I'm mostly interested in the why/why not answer.)
Brion Vibber wrote:
Evan Martin wrote:
I've often wondered this, so this is a great opportunity to jump in. Why not cache prerendered versions of all pages?
We do. :)
... sort of. i still want to use the file cache for this, i'm sure we could massively reduce page render counts even over the normal parser cache, not to mention the startup time for the wiki (Setup.php).
kate.
On Thu, Sep 08, 2005 at 11:31:54PM +0100, Kate wrote:
Brion Vibber wrote:
Evan Martin wrote:
I've often wondered this, so this is a great opportunity to jump in. Why not cache prerendered versions of all pages?
We do. :)
... sort of. i still want to use the file cache for this, i'm sure we could massively reduce page render counts even over the normal parser cache, not to mention the startup time for the wiki (Setup.php).
Isn't this what we use squid for? Only for logged in users (and cache misses) Setup.php et.al. are called. It's needed to load user preferences, "You've new messages" notifications etc.
Regards,
jens
Evan Martin wrote:
Why not cache prerendered versions of all pages?
As Brion already mentioned, we're already doing this, but even more:
One tricky part is that writes to page B can affect page A if page A has a link to B. A reverse index of links would solve this, though I don't know how bit it'd be.
We have a table, 'links' I think, which stores all links between pages. The 'from' of each link is the numerical page ID (so we don't need to update it when the page is moved), and the 'to' is the namespace and title (so we don't need to update it either when a page is moved because we really do want the link to continue to go to the old title). The only overhead incurred is that when saving an edit, we need to update this table with the links that were added or removed from the article, and when an article gets deleted links to the deleted page may need to be moved to 'brokenlinks' (another table; its purpose should be obvious now) and links from that deleted page need to be deleted.
So yeah, we already have a reverse index of links, and we already use it in a lot of places (the most user-visible one being [[Special:Whatlinkshere]]).
There are separate tables for 'categorylinks', 'imagelinks' and others that I can't think of now.
Timwi
Tim Starling wrote:
See http://meta.wikimedia.org/wiki/Profiling/20050822 [...] It's much easier to produce user-visible speed improvements by concentrating on relatively rare but slow operations. These often
Good thinking, but your kind of profiling output doesn't support this kind of optimization. At susning.nu I have a log file that reports slow spots and nothing but that. If any page view takes more than 10 seconds, an apology is added to the page footer. It's very easy:
sub TimeLog { my ($text) = @_; my $elapsed = time - $EntryTime; if ($elapsed > 10) { $SlownessExcuse = "<br><b>We are sorry</b> " . "that this page took so long to display.\n" . . "If you have any questions about this, ...\n"; } if ($trace || $elapsed > 4) { my $line = LogTime(time) . " $elapsed $EntryTime $text\n"; open (OUT, ">>$LogDir/time_log"); print OUT $line; close(OUT); $EntryTime = time unless $trace; # reset to avoid repeated alarms } }
Then just sprinkle your code with calls to TimeLog("diff..."). As long as less than 4 seconds have passed since this request was received, these calls cost so little that you can think of them as comments to the source code. If your "time_log" file grows fast, optimize the most commonly called spots first.
Lars Aronsson wrote:
Tim Starling wrote:
See http://meta.wikimedia.org/wiki/Profiling/20050822 [...] It's much easier to produce user-visible speed improvements by concentrating on relatively rare but slow operations. These often
Good thinking, but your kind of profiling output doesn't support this kind of optimization.
Strange, how did I manage to perform it successfully so many times then? Must have been magic. Feel free to gasp in awe ;)
-- Tim Starling
wikitech-l@lists.wikimedia.org