Jonathan wrote:
Is there a way analyze the wikipedia logs to figure
out what processes
takes the most time?
For some past stuff, see:
http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling
Domas and Tim are more recently playing with kcachegrind and things to get some
even more fine-grained recordings and pretty visualizations.
There is no immediate need, but I wanted to shoot
off an idea to consider. If we were able to capture the processes that
do take a heavy server load and push them onto a distributed process
pool, would it help wikipedia or mediawiki in general?
Well, that's basically what the apache cluster is... :)
Let say we determined that the code that creates a
diff for two
pages is a hog and can be put into the pool. We could use something like
BOINC,
http://boinc.berkeley.edu/, to standardize the pool. We can add
the diff process to the pool as the server load gets heavy.
Well, diffs for instance are very rare compared to other tasks; they've never
been an overall drain on resources, but "pathological" diffs on very large
pages
can be slow, which is frustrating for interactive performance. When the diff
you're looking at takes forever to return (or times out and doesn't return at
all) it's rather a pain.
(Note that diff has already been hugely sped up between caching and rewriting it
as a C++ plugin.)
Offloading a potentially heavy task like diffing or thumbnail generation to yet
another set of servers could have plusses or minuses; a big minus is that such
queuing could negatively impact interactive performance. You don't just want
that diff some day; you want it right when you click on it.
-- brion vibber (brion @
pobox.com)