parallel bzip2 (de)compression of the dump - Wikitech-l

26 Mar 2009

Hi,

after reading the following sections:

http://wikitech.wikimedia.org/view/Data_dump_redesign#Follow_up
http://en.wikipedia.org/wiki/Wikipedia_database#Dealing_with_compressed_fil…
http://meta.wikimedia.org/wiki/Data_dumps#bzip2
http://www.mediawiki.org/wiki/Mwdumper#Usage
http://www.mediawiki.org/wiki/Dbzip2#Development_status

and skimming the January, February and March archives of this year (all of 
which may be outdated and/or incomplete, and then I'll sound like an 
idiot), I'd like to say the following:

** 1. If the export process uses dbzip2 to compress the dump, and dbzip2's 
MO is to compress input blocks independently, then to bit-shift the 
resulting compressed blocks (= single-block bzip2 streams) back into a 
single multi-block bzip2 stream, so that the resulting file is 
bit-identical to what bzip2 would produce, then the export process wastes 
(CPU) time. Bunzip2 can decompress concatenated bzip2 streams. In exchange 
for a small size penalty, the dumper could just concatenate the 
single-block bzip2 streams, saving a lot of cycles.

** 2. If dump.bz2 was single-block, many-stream (as opposed to the current 
many-block, single-stream), then people on the importing end could speed 
up *decompression* with pbzip2.

** 3. Even if dump2.bz2 stays single-stream, *or* it becomes multi-stream 
*but* is available only from a pipe or socket, decompression can still be 
sped up by way of lbzip2 (which I wrote, and am promoting here). Since 
it's written in strict adherence to the Single UNIX Specification, Version 
2, it's available on Cygwin too, and should work on the Mac.

Dependent on the circumstances (number of cores, availability of dump.bz2 from 
a regular file or just a pipe, etc) different bunzip2 implementations are best. 
For example, on my dual core desktop, even

7za e -tbzip2 -so dump.bz2

performs best in some cases (which -- I guess -- parallelizes the different 
stages of the decompression).

For my more complete analysis (with explicit points on (my imagination of) 
dbzip2), please see

http://lists.debian.org/debian-mentors/2009/02/msg00135.html

** 4. Thanassis Tsiodras' offline reader, available under

http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html

uses, according to section "Seeking in the dump file", bzip2recover to 
split the bzip2 blocks out of the single bzip2 stream. The page states

 	This process is fast (since it involves almost no CPU calculations

While this may be true relative to other dump-processing operations, 
bzip2recover is, in fact, not much more than a huge single threaded 
bit-shifter, which even makes two passes over the dump. (IIRC, the first 
pass shifts over the whole dump to find bzip2 block delimiteres, then the 
second pass shifts the blocks found previously into byte-aligned, separate 
bzip2 streams.)

Since lbzip2's multiple-workers decompressor distributes the search for 
bzip2 block headers over all cores, a list of bzip2 block bit positions 
(or the separate files themselves) could be created faster, by hacking a 
bit on lbzip2 (as in "print positions, omit decompression").

Or dbzip2 itself could enable efficient seeking in the compressed dump by 
saving named bit positions in a separate text file.

-o-

My purpose with this mail is two-fold:

- To promote lbzip2. I honestly believe it can help dump importers. I'm 
also promoting, with obviously less bias, pbzip2 and 7za, because in some 
decompression situations they beat lbzip2, and I feel their usefulness 
isn't emphasized enough in the links above. (If parallel decompression for 
importDump.php and/or MWDumper is a widely solved problem, then I'm sorry 
for the noise.)

- To ask a question. Can someone please describe the current (and planned) 
way of compressing/decompressing the dump? (If I'd had more recent info on 
this, perhaps I wouldn't have bothered the list with this post. I'm also 
just plain curious.)

Thanks,
lacos

http://phptest11.atw.hu/
http://lacos.web.elte.hu/pub/lbzip2/