Hi,
after reading the following sections:
http://wikitech.wikimedia.org/view/Data_dump_redesign#Follow_up http://en.wikipedia.org/wiki/Wikipedia_database#Dealing_with_compressed_file... http://meta.wikimedia.org/wiki/Data_dumps#bzip2 http://www.mediawiki.org/wiki/Mwdumper#Usage http://www.mediawiki.org/wiki/Dbzip2#Development_status
and skimming the January, February and March archives of this year (all of which may be outdated and/or incomplete, and then I'll sound like an idiot), I'd like to say the following:
** 1. If the export process uses dbzip2 to compress the dump, and dbzip2's MO is to compress input blocks independently, then to bit-shift the resulting compressed blocks (= single-block bzip2 streams) back into a single multi-block bzip2 stream, so that the resulting file is bit-identical to what bzip2 would produce, then the export process wastes (CPU) time. Bunzip2 can decompress concatenated bzip2 streams. In exchange for a small size penalty, the dumper could just concatenate the single-block bzip2 streams, saving a lot of cycles.
** 2. If dump.bz2 was single-block, many-stream (as opposed to the current many-block, single-stream), then people on the importing end could speed up *decompression* with pbzip2.
** 3. Even if dump2.bz2 stays single-stream, *or* it becomes multi-stream *but* is available only from a pipe or socket, decompression can still be sped up by way of lbzip2 (which I wrote, and am promoting here). Since it's written in strict adherence to the Single UNIX Specification, Version 2, it's available on Cygwin too, and should work on the Mac.
Dependent on the circumstances (number of cores, availability of dump.bz2 from a regular file or just a pipe, etc) different bunzip2 implementations are best. For example, on my dual core desktop, even
7za e -tbzip2 -so dump.bz2
performs best in some cases (which -- I guess -- parallelizes the different stages of the decompression).
For my more complete analysis (with explicit points on (my imagination of) dbzip2), please see
http://lists.debian.org/debian-mentors/2009/02/msg00135.html
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations, bzip2recover is, in fact, not much more than a huge single threaded bit-shifter, which even makes two passes over the dump. (IIRC, the first pass shifts over the whole dump to find bzip2 block delimiteres, then the second pass shifts the blocks found previously into byte-aligned, separate bzip2 streams.)
Since lbzip2's multiple-workers decompressor distributes the search for bzip2 block headers over all cores, a list of bzip2 block bit positions (or the separate files themselves) could be created faster, by hacking a bit on lbzip2 (as in "print positions, omit decompression").
Or dbzip2 itself could enable efficient seeking in the compressed dump by saving named bit positions in a separate text file.
-o-
My purpose with this mail is two-fold:
- To promote lbzip2. I honestly believe it can help dump importers. I'm also promoting, with obviously less bias, pbzip2 and 7za, because in some decompression situations they beat lbzip2, and I feel their usefulness isn't emphasized enough in the links above. (If parallel decompression for importDump.php and/or MWDumper is a widely solved problem, then I'm sorry for the noise.)
- To ask a question. Can someone please describe the current (and planned) way of compressing/decompressing the dump? (If I'd had more recent info on this, perhaps I wouldn't have bothered the list with this post. I'm also just plain curious.)
Thanks, lacos
http://phptest11.atw.hu/ http://lacos.web.elte.hu/pub/lbzip2/