Hi,
after reading the following sections:
http://wikitech.wikimedia.org/view/Data_dump_redesign#Follow_up
http://en.wikipedia.org/wiki/Wikipedia_database#Dealing_with_compressed_fil…
http://meta.wikimedia.org/wiki/Data_dumps#bzip2
http://www.mediawiki.org/wiki/Mwdumper#Usage
http://www.mediawiki.org/wiki/Dbzip2#Development_status
and skimming the January, February and March archives of this year (all of
which may be outdated and/or incomplete, and then I'll sound like an
idiot), I'd like to say the following:
** 1. If the export process uses dbzip2 to compress the dump, and dbzip2's
MO is to compress input blocks independently, then to bit-shift the
resulting compressed blocks (= single-block bzip2 streams) back into a
single multi-block bzip2 stream, so that the resulting file is
bit-identical to what bzip2 would produce, then the export process wastes
(CPU) time. Bunzip2 can decompress concatenated bzip2 streams. In exchange
for a small size penalty, the dumper could just concatenate the
single-block bzip2 streams, saving a lot of cycles.
** 2. If dump.bz2 was single-block, many-stream (as opposed to the current
many-block, single-stream), then people on the importing end could speed
up *decompression* with pbzip2.
** 3. Even if dump2.bz2 stays single-stream, *or* it becomes multi-stream
*but* is available only from a pipe or socket, decompression can still be
sped up by way of lbzip2 (which I wrote, and am promoting here). Since
it's written in strict adherence to the Single UNIX Specification, Version
2, it's available on Cygwin too, and should work on the Mac.
Dependent on the circumstances (number of cores, availability of dump.bz2 from
a regular file or just a pipe, etc) different bunzip2 implementations are best.
For example, on my dual core desktop, even
7za e -tbzip2 -so dump.bz2
performs best in some cases (which -- I guess -- parallelizes the different
stages of the decompression).
For my more complete analysis (with explicit points on (my imagination of)
dbzip2), please see
http://lists.debian.org/debian-mentors/2009/02/msg00135.html
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to
split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations,
bzip2recover is, in fact, not much more than a huge single threaded
bit-shifter, which even makes two passes over the dump. (IIRC, the first
pass shifts over the whole dump to find bzip2 block delimiteres, then the
second pass shifts the blocks found previously into byte-aligned, separate
bzip2 streams.)
Since lbzip2's multiple-workers decompressor distributes the search for
bzip2 block headers over all cores, a list of bzip2 block bit positions
(or the separate files themselves) could be created faster, by hacking a
bit on lbzip2 (as in "print positions, omit decompression").
Or dbzip2 itself could enable efficient seeking in the compressed dump by
saving named bit positions in a separate text file.
-o-
My purpose with this mail is two-fold:
- To promote lbzip2. I honestly believe it can help dump importers. I'm
also promoting, with obviously less bias, pbzip2 and 7za, because in some
decompression situations they beat lbzip2, and I feel their usefulness
isn't emphasized enough in the links above. (If parallel decompression for
importDump.php and/or MWDumper is a widely solved problem, then I'm sorry
for the noise.)
- To ask a question. Can someone please describe the current (and planned)
way of compressing/decompressing the dump? (If I'd had more recent info on
this, perhaps I wouldn't have bothered the list with this post. I'm also
just plain curious.)
Thanks,
lacos
http://phptest11.atw.hu/
http://lacos.web.elte.hu/pub/lbzip2/