Στις 28-01-2012, ημέρα Σαβ, και ώρα 08:34 +0100, ο/η Federico Leva (Nemo) έγραψε:
Richard Jelinek, 28/01/2012 00:38:
don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't...
There's a quite old comparison here: https://www.mediawiki.org/wiki/Dbzip2 but https://wikitech.wikimedia.org/view/Dumps/Parallelization suggests it's still relevant.
I need to revisit this issue and look at it carefully at some point. I will say for now that by running multiple workers on one host, we already make use of multiple cpus for the small and medium wikis. For en wikipedia, we produce multiple pieces at once for each phase that matters; this includes production of the gzipped stub files, so once again for that case we are making the maximum use of our cpus and memory on the host that runs those.
It's possible that we could recombine the enwiki dumps into a single file by using pbzip2, this being where compression using one cpu would slow us down, but right now we just skip that step. People have been fine with using the smaller files and in fact seem to prefer them.
The other thing about switching from one bzip2 implementation to another is that I rely on some specific properties of the bzip2 output (and its library) for integrity checking and for locating blocks in the middle of a dump when needed. I'd need to make sure my hacks still worked with the new output.
Ariel