Re: [Wikitech-l] Distributed process pool

18 Mar 2006


      Jonathan wrote:
...
Is there a way analyze the wikipedia logs to figure out what processes 
takes the most time?
For some past stuff, see:
http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling
Domas and Tim are more recently playing with kcachegrind and things to get some
even more fine-grained recordings and pretty visualizations.
...
There is no immediate need, but I wanted to shoot 
off an idea to consider. If we were able to capture the processes that 
do take a heavy server load and push them onto a distributed process 
pool, would it help wikipedia or mediawiki in general?
Well, that's basically what the apache cluster is... :)
...
Let say we determined that the code that creates a diff for two 
pages is a hog and can be put into the pool. We could use something like 
BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add 
the diff process to the pool as the server load gets heavy.
Well, diffs for instance are very rare compared to other tasks; they've never
been an overall drain on resources, but "pathological" diffs on very large pages
can be slow, which is frustrating for interactive performance. When the diff
you're looking at takes forever to return (or times out and doesn't return at
all) it's rather a pain.
(Note that diff has already been hugely sped up between caching and rewriting it
as a C++ plugin.)
Offloading a potentially heavy task like diffing or thumbnail generation to yet
another set of servers could have plusses or minuses; a big minus is that such
queuing could negatively impact interactive performance. You don't just want
that diff some day; you want it right when you click on it.
-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Distributed process pool