Στις 28-01-2012, ημέρα Σαβ, και ώρα 08:34 +0100, ο/η Federico Leva
(Nemo) έγραψε:
Richard Jelinek, 28/01/2012 00:38:
don't know if this issue came up already - in
case it did and has been
dismissed, I beg your pardon. In case it didn't...
There's a quite old comparison here:
https://www.mediawiki.org/wiki/Dbzip2 but
https://wikitech.wikimedia.org/view/Dumps/Parallelization suggests it's
still relevant.
I need to revisit this issue and look at it carefully at some point. I
will say for now that by running multiple workers on one host, we
already make use of multiple cpus for the small and medium wikis. For
en wikipedia, we produce multiple pieces at once for each phase that
matters; this includes production of the gzipped stub files, so once
again for that case we are making the maximum use of our cpus and memory
on the host that runs those.
It's possible that we could recombine the enwiki dumps into a single
file by using pbzip2, this being where compression using one cpu would
slow us down, but right now we just skip that step. People have been
fine with using the smaller files and in fact seem to prefer them.
The other thing about switching from one bzip2 implementation to another
is that I rely on some specific properties of the bzip2 output (and its
library) for integrity checking and for locating blocks in the middle of
a dump when needed. I'd need to make sure my hacks still worked with the
new output.
Ariel