Jonathan wrote:
Is there a way analyze the wikipedia logs to figure out what processes takes the most time?
For some past stuff, see: http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling
Domas and Tim are more recently playing with kcachegrind and things to get some even more fine-grained recordings and pretty visualizations.
There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general?
Well, that's basically what the apache cluster is... :)
Let say we determined that the code that creates a diff for two pages is a hog and can be put into the pool. We could use something like BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add the diff process to the pool as the server load gets heavy.
Well, diffs for instance are very rare compared to other tasks; they've never been an overall drain on resources, but "pathological" diffs on very large pages can be slow, which is frustrating for interactive performance. When the diff you're looking at takes forever to return (or times out and doesn't return at all) it's rather a pain.
(Note that diff has already been hugely sped up between caching and rewriting it as a C++ plugin.)
Offloading a potentially heavy task like diffing or thumbnail generation to yet another set of servers could have plusses or minuses; a big minus is that such queuing could negatively impact interactive performance. You don't just want that diff some day; you want it right when you click on it.
-- brion vibber (brion @ pobox.com)