Robert Rohde wrote:
Many of the things done for the statistical analysis
of database dumps
should be suitable for parallelization (e.g. break the dump into
chunks, process the chunks in parallel and sum the results). You
could talk to Erik Zachte. I don't know if his code has already been
designed for parallel processing though.
I don't think it's a good candidate since you are presumably using
compressed files, and its decompression linearises it (and is most
likely the bottleneck, too).
Another option might be to look at the methods for
compressing old
revisions (is [1] still current?).
I make heavy use of parallel processing in my professional work (not
related to wikis), but I can't really think of any projects I have at
hand that would be accessible and completable in a month.
-Robert Rohde
[1]
http://www.mediawiki.org/wiki/Manual:CompressOld.php
It can be used, I am unsure if it is used by WMF.
Another thing that would be nice to have parallelised would be things
like parser tests. That would need adding cotasks to php or so. The most
similar extension I know is runkit which is the other way around:
several php scopes instead of several threads in one scope.