On Tue, Oct 26, 2010 at 8:25 AM, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η
Platonides έγραψε:
Robert Rohde wrote:
Many of the things done for the statistical
analysis of database dumps
should be suitable for parallelization (e.g. break the dump into
chunks, process the chunks in parallel and sum the results). You
could talk to Erik Zachte. I don't know if his code has already been
designed for parallel processing though.
I don't think it's a good candidate since you are presumably using
compressed files, and its decompression linearises it (and is most
likely the bottleneck, too).
If one were clever (and I have some code that would enable one to be
clever), one could seek to some point in the (bzip2-compressed) file and
uncompress from there before processing. Running a bunch of jobs each
decompressing only their small piece then becomes feasible. I don't
have code that does this for gz or 7z; afaik these do not do compression
in discrete blocks.
Actually the LMZA used by default in 7z can be partially parallelized
with some strong limitations:
1) The location of block N is generally only located by finding the
end of block N-1, so files have to be read serially.
2) The ability to decompress block N may or may not depend on already
having decompressed blocks N-1, N-2, N-3, etc., depending on the
details of the data stream.
Point 2 in particular tends to lead to a lot of conflicts that
prevents parallelization. If block N happens to be independent of
block N-1 then they can be done in parallel, but in general this will
not be the case. The frequency of such conflicts depends a lot on the
data stream and options given to the compressor.
Last year LMZA2 was introduced in 7z with the primary intent of
improving parallelization. It actually produces slightly worse
compression in general, but can be operated to guarantee that block N
is independent of blocks N-1 ... N-k for a specified k, meaning that
k+1 blocks can always be considered in parallel.
I believe that gzip has similar constraints to LMZA that make
parallelization problematic, but I'm not sure about that.
Getting back to Wikimedia, it appears correct that the Wikistats code
is designed to run from the compressed files (source linked from [1]).
As you suggest, one could use the properties of .bz2 format to
parallelize that. I would also observe that parsers tend to be
relatively slow, while decompressors tend to be relatively fast. I
wouldn't necessarily assume that the decompressing is the only
bottleneck. I've run analyses on dumps that took longer to execute
than it took to decompress the files. However, they probably didn't
take that many times longer (i.e. if the process were parallelized in
2 to 4 simultaneous chunks, then the decompression would be the
primary bottleneck again).
So it is probably true that if one wants to see a large increase in
the speed of stats processing one needs to consider parallelizing both
the decompression and the stats gathering.
-Robert Rohde
[1]
http://stats.wikimedia.org/index_tabbed_new.html#fragment-14