On Tue, Oct 26, 2010 at 8:25 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
Robert Rohde wrote:
Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though.
I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too).
If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks.
Actually the LMZA used by default in 7z can be partially parallelized with some strong limitations:
1) The location of block N is generally only located by finding the end of block N-1, so files have to be read serially. 2) The ability to decompress block N may or may not depend on already having decompressed blocks N-1, N-2, N-3, etc., depending on the details of the data stream.
Point 2 in particular tends to lead to a lot of conflicts that prevents parallelization. If block N happens to be independent of block N-1 then they can be done in parallel, but in general this will not be the case. The frequency of such conflicts depends a lot on the data stream and options given to the compressor.
Last year LMZA2 was introduced in 7z with the primary intent of improving parallelization. It actually produces slightly worse compression in general, but can be operated to guarantee that block N is independent of blocks N-1 ... N-k for a specified k, meaning that k+1 blocks can always be considered in parallel.
I believe that gzip has similar constraints to LMZA that make parallelization problematic, but I'm not sure about that.
Getting back to Wikimedia, it appears correct that the Wikistats code is designed to run from the compressed files (source linked from [1]). As you suggest, one could use the properties of .bz2 format to parallelize that. I would also observe that parsers tend to be relatively slow, while decompressors tend to be relatively fast. I wouldn't necessarily assume that the decompressing is the only bottleneck. I've run analyses on dumps that took longer to execute than it took to decompress the files. However, they probably didn't take that many times longer (i.e. if the process were parallelized in 2 to 4 simultaneous chunks, then the decompression would be the primary bottleneck again).
So it is probably true that if one wants to see a large increase in the speed of stats processing one needs to consider parallelizing both the decompression and the stats gathering.
-Robert Rohde
[1] http://stats.wikimedia.org/index_tabbed_new.html#fragment-14