Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
Robert Rohde wrote:
Many of the things done for the statistical
analysis of database dumps
should be suitable for parallelization (e.g. break the dump into
chunks, process the chunks in parallel and sum the results). You
could talk to Erik Zachte. I don't know if his code has already been
designed for parallel processing though.
I don't think it's a good candidate since you are presumably using
compressed files, and its decompression linearises it (and is most
likely the bottleneck, too).
If one were clever (and I have some code that would enable one to be
clever), one could seek to some point in the (bzip2-compressed) file and
uncompress from there before processing. Running a bunch of jobs each
decompressing only their small piece then becomes feasible. I don't
have code that does this for gz or 7z; afaik these do not do compression
in discrete blocks.
Ariel