Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
Robert Rohde wrote:
Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though.
I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too).
If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks.
Ariel