Hi,
after reading the following sections:
http://wikitech.wikimedia.org/view/Data_dump_redesign#Follow_up http://en.wikipedia.org/wiki/Wikipedia_database#Dealing_with_compressed_file... http://meta.wikimedia.org/wiki/Data_dumps#bzip2 http://www.mediawiki.org/wiki/Mwdumper#Usage http://www.mediawiki.org/wiki/Dbzip2#Development_status
and skimming the January, February and March archives of this year (all of which may be outdated and/or incomplete, and then I'll sound like an idiot), I'd like to say the following:
** 1. If the export process uses dbzip2 to compress the dump, and dbzip2's MO is to compress input blocks independently, then to bit-shift the resulting compressed blocks (= single-block bzip2 streams) back into a single multi-block bzip2 stream, so that the resulting file is bit-identical to what bzip2 would produce, then the export process wastes (CPU) time. Bunzip2 can decompress concatenated bzip2 streams. In exchange for a small size penalty, the dumper could just concatenate the single-block bzip2 streams, saving a lot of cycles.
** 2. If dump.bz2 was single-block, many-stream (as opposed to the current many-block, single-stream), then people on the importing end could speed up *decompression* with pbzip2.
** 3. Even if dump2.bz2 stays single-stream, *or* it becomes multi-stream *but* is available only from a pipe or socket, decompression can still be sped up by way of lbzip2 (which I wrote, and am promoting here). Since it's written in strict adherence to the Single UNIX Specification, Version 2, it's available on Cygwin too, and should work on the Mac.
Dependent on the circumstances (number of cores, availability of dump.bz2 from a regular file or just a pipe, etc) different bunzip2 implementations are best. For example, on my dual core desktop, even
7za e -tbzip2 -so dump.bz2
performs best in some cases (which -- I guess -- parallelizes the different stages of the decompression).
For my more complete analysis (with explicit points on (my imagination of) dbzip2), please see
http://lists.debian.org/debian-mentors/2009/02/msg00135.html
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations, bzip2recover is, in fact, not much more than a huge single threaded bit-shifter, which even makes two passes over the dump. (IIRC, the first pass shifts over the whole dump to find bzip2 block delimiteres, then the second pass shifts the blocks found previously into byte-aligned, separate bzip2 streams.)
Since lbzip2's multiple-workers decompressor distributes the search for bzip2 block headers over all cores, a list of bzip2 block bit positions (or the separate files themselves) could be created faster, by hacking a bit on lbzip2 (as in "print positions, omit decompression").
Or dbzip2 itself could enable efficient seeking in the compressed dump by saving named bit positions in a separate text file.
-o-
My purpose with this mail is two-fold:
- To promote lbzip2. I honestly believe it can help dump importers. I'm also promoting, with obviously less bias, pbzip2 and 7za, because in some decompression situations they beat lbzip2, and I feel their usefulness isn't emphasized enough in the links above. (If parallel decompression for importDump.php and/or MWDumper is a widely solved problem, then I'm sorry for the noise.)
- To ask a question. Can someone please describe the current (and planned) way of compressing/decompressing the dump? (If I'd had more recent info on this, perhaps I wouldn't have bothered the list with this post. I'm also just plain curious.)
Thanks, lacos
http://phptest11.atw.hu/ http://lacos.web.elte.hu/pub/lbzip2/
ERSEK Laszlo wrote:
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations, bzip2recover is, in fact, not much more than a huge single threaded bit-shifter, which even makes two passes over the dump. (IIRC, the first pass shifts over the whole dump to find bzip2 block delimiteres, then the second pass shifts the blocks found previously into byte-aligned, separate bzip2 streams.)
Hmm? Admittedly, I don't know the bzip2 format very well, but as far as I understand it, there should be no bit-shifting involved: each block in the stream is a completely independent, self-contained sequence of bytes.
On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen nospam@vyznev.net wrote:
ERSEK Laszlo wrote:
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations, bzip2recover is, in fact, not much more than a huge single threaded bit-shifter, which even makes two passes over the dump. (IIRC, the first pass shifts over the whole dump to find bzip2 block delimiteres, then the second pass shifts the blocks found previously into byte-aligned, separate bzip2 streams.)
Hmm? Admittedly, I don't know the bzip2 format very well, but as far as I understand it, there should be no bit-shifting involved: each block in the stream is a completely independent, self-contained sequence of bytes.
I believe the point is that each block is a self-contained sequence of bits not bytes, so a block can terminate in the middle of a byte. The next block is appended immediately (if I understand correctly), so block boundaries do not necessarily align to byte boundaries. Hence the need to do bit shifting.
-Robert Rohde
Robert Rohde wrote:
On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen nospam@vyznev.net wrote:
Hmm? Admittedly, I don't know the bzip2 format very well, but as far as I understand it, there should be no bit-shifting involved: each block in the stream is a completely independent, self-contained sequence of bytes.
I believe the point is that each block is a self-contained sequence of bits not bytes, so a block can terminate in the middle of a byte. The next block is appended immediately (if I understand correctly), so block boundaries do not necessarily align to byte boundaries. Hence the need to do bit shifting.
The Wikipedia article (what else?) on the format says the blocks are padded to byte boundaries, and some quick testing seems to support that.
Ilmari Karonen wrote:
Robert Rohde wrote:
On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen nospam@vyznev.net wrote:
Hmm? Admittedly, I don't know the bzip2 format very well, but as far as I understand it, there should be no bit-shifting involved: each block in the stream is a completely independent, self-contained sequence of bytes.
I believe the point is that each block is a self-contained sequence of bits not bytes, so a block can terminate in the middle of a byte. The next block is appended immediately (if I understand correctly), so block boundaries do not necessarily align to byte boundaries. Hence the need to do bit shifting.
The Wikipedia article (what else?) on the format says the blocks are padded to byte boundaries, and some quick testing seems to support that.
I'm not so sure about it. I first tried to flush blocks instead of starting a new bzip2 file but couldn't read it back. bzip2recover output looked like they weren't on byte boundaries (the bits weren't multiple of 8).
On 03/26/09 20:30, Ilmari Karonen wrote:
The Wikipedia article (what else?) on the format says the blocks are padded to byte boundaries, and some quick testing seems to support that.
http://en.wikipedia.org/wiki/Bzip2#File_format
The compressed blocks are bit-aligned and no padding occurs.
The bzip2 stream is padded.
decompressible ::= stream | decompressible decompressible
stream ::= stream_header block* stream_footer
stream_header ::= STREAM_START_MAGIC VERSION BLOCKSIZE
block ::= BLOCK_START_MAGIC BLOCK_CRC BLOCK_DATA
stream_footer ::= STREAM_END_MAGIC COMBINED_STREAM_CRC STREAM_PADDING
Bunzip2 decompresses "decompressible", while bzip2 creates (in a single run) one "stream". A "decompressible" can be formed by concatenating two other "decompressible"s. In a "decompressible", the meat of any given bzip2 block starts at BLOCK_START_MAGIC, and terminates right before the next BLOCK_START_MAGIC or STREAM_END_MAGIC, whichever comes first.
lacos
On 3/26/09 12:30 PM, Ilmari Karonen wrote:
Robert Rohde wrote:
On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonennospam@vyznev.net wrote:
Hmm? Admittedly, I don't know the bzip2 format very well, but as far as I understand it, there should be no bit-shifting involved: each block in the stream is a completely independent, self-contained sequence of bytes.
I believe the point is that each block is a self-contained sequence of bits not bytes, so a block can terminate in the middle of a byte. The next block is appended immediately (if I understand correctly), so block boundaries do not necessarily align to byte boundaries. Hence the need to do bit shifting.
The Wikipedia article (what else?) on the format says the blocks are padded to byte boundaries, and some quick testing seems to support that.
That is a filthy lie. :)
There is indeed no byte padding between blocks; it made my implementation of a parallel bzip2 compressor much harder and I never got round to finishing the decompressor.
-- brion
Brion Vibber wrote:
On 3/26/09 12:30 PM, Ilmari Karonen wrote:
The Wikipedia article (what else?) on the format says the blocks are padded to byte boundaries, and some quick testing seems to support that.
That is a filthy lie. :)
There is indeed no byte padding between blocks; it made my implementation of a parallel bzip2 compressor much harder and I never got round to finishing the decompressor.
You're right, I misread it. Only the whole stream is padded. :(
(In my defense, that seems like such a moronic design choice that I couldn't believe it could be true. If you're going to waste 48 bits per block on pi-in-BCD anyway, it seems silly to skimp on the 4 bits of per block that'd be needed on average to pad to a byte boundary.)
On 03/27/09 02:21, Ilmari Karonen wrote:
Brion Vibber wrote:
On 3/26/09 12:30 PM, Ilmari Karonen wrote:
The Wikipedia article (what else?) on the format says the blocks are padded to byte boundaries, and some quick testing seems to support that.
That is a filthy lie. :)
There is indeed no byte padding between blocks; it made my implementation of a parallel bzip2 compressor much harder and I never got round to finishing the decompressor.
You're right, I misread it. Only the whole stream is padded. :(
(In my defense, that seems like such a moronic design choice that I couldn't believe it could be true. If you're going to waste 48 bits per block on pi-in-BCD anyway, it seems silly to skimp on the 4 bits of per block that'd be needed on average to pad to a byte boundary.)
http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#limits
"Much of this complexity could have been avoided if the compressed size of each block of data was recorded in the data stream."
lacos
Srsly.
-- brion vibber (brion @ wikimedia.org)
On Mar 26, 2009, at 18:21, Ilmari Karonen nospam@vyznev.net wrote:
Brion Vibber wrote:
On 3/26/09 12:30 PM, Ilmari Karonen wrote:
The Wikipedia article (what else?) on the format says the blocks are padded to byte boundaries, and some quick testing seems to support that.
That is a filthy lie. :)
There is indeed no byte padding between blocks; it made my implementation of a parallel bzip2 compressor much harder and I never got round to finishing the decompressor.
You're right, I misread it. Only the whole stream is padded. :(
(In my defense, that seems like such a moronic design choice that I couldn't believe it could be true. If you're going to waste 48 bits per block on pi-in-BCD anyway, it seems silly to skimp on the 4 bits of per block that'd be needed on average to pad to a byte boundary.)
-- Ilmari Karonen
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Mar 26, 2009 at 3:09 PM, Ilmari Karonen nospam@vyznev.net wrote:
ERSEK Laszlo wrote:
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.htmlhttp://users.softlab.ece.ntua.gr/%7Ettsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations, bzip2recover is, in fact, not much more than a huge single threaded bit-shifter, which even makes two passes over the dump. (IIRC, the first pass shifts over the whole dump to find bzip2 block delimiteres, then the second pass shifts the blocks found previously into byte-aligned,
separate
bzip2 streams.)
Hmm? Admittedly, I don't know the bzip2 format very well, but as far as I understand it, there should be no bit-shifting involved: each block in the stream is a completely independent, self-contained sequence of bytes.
I'm not sure about the bit-shifting, but the second pass of bzip2recover adds in the file headers.
The thing is, the second pass of bzip2recover is unnecessary if all you want to do is build a file index. And the source code of the first pass of bzip2recover is really simple. I've managed to hack it to make a single pass outputting the starting bytes of each block, and have been able to use the index to dd out the block I want to access (plus a few bytes padding on each end because I was too lazy to find out the exact location), run the real bzip2recover on just that block, and then uncompress the recovered block. That could be made into a lot less of a hack if I wanted to take the time to figure out how exactly to rebuild the header, but I haven't bothered.
Anyway, the point being that it's not necessary to actually run bzip2recover and deal with millions of little files, if you build an index. That way you don't need twice the space, and don't need to mess with reiserfs.
On 3/26/09 10:58 AM, ERSEK Laszlo wrote:
** 1. If the export process uses dbzip2 to compress the dump, and dbzip2's MO is to compress input blocks independently, then to bit-shift the resulting compressed blocks (= single-block bzip2 streams) back into a single multi-block bzip2 stream, so that the resulting file is bit-identical to what bzip2 would produce, then the export process wastes (CPU) time. Bunzip2 can decompress concatenated bzip2 streams. In exchange for a small size penalty, the dumper could just concatenate the single-block bzip2 streams, saving a lot of cycles.
It's been years since I poked it seriously so I don't recall any exact figures, but I doubt it's very many cycles, and mass bit-shifting is likely trivial to optimize should anyone feel it necessary.
More importantly, not every decompressor will decompress concatenated streams. Dictating which decoder end-users should use is not cool. :)
** 2. If dump.bz2 was single-block, many-stream (as opposed to the current many-block, single-stream), then people on the importing end could speed up *decompression* with pbzip2.
Lack of compatibility with other tools makes this format undesirable; further note that a smarter decompressor could act as bzip2recover does to estimate block boundaries and decompress them speculatively. In the rare case of an incorrect match, you've only lost one to two blocks' worth of time.
I never got round to completing the decompressor implementation for dbzip2, though.
** 3. Even if dump2.bz2 stays single-stream, *or* it becomes multi-stream *but* is available only from a pipe or socket, decompression can still be sped up by way of lbzip2 (which I wrote, and am promoting here). Since it's written in strict adherence to the Single UNIX Specification, Version 2, it's available on Cygwin too, and should work on the Mac.
Awesome!
-- brion
Brion Vibber wrote:
More importantly, not every decompressor will decompress concatenated streams. Dictating which decoder end-users should use is not cool. :)
The reference bzip2 tool has supported it for ages. I added support for concatenated bzip2 files to php bz2 on September. It is only supported on the newer php versions. (Oddly, importDump.php doesn't seem to be supporting bzipped dumps) Don't know about java/mwdumper support.
** 2. If dump.bz2 was single-block, many-stream (as opposed to the current many-block, single-stream), then people on the importing end could speed up *decompression* with pbzip2.
Lack of compatibility with other tools makes this format undesirable; further note that a smarter decompressor could act as bzip2recover does to estimate block boundaries and decompress them speculatively. In the rare case of an incorrect match, you've only lost one to two blocks' worth of time.
Support of those other tools for streams add quite a complexity for decompressors wanting to decompress only a block (due to the byte-unaligned nature of blocks).
I never got round to completing the decompressor implementation for dbzip2, though..
The code at http://svn.wikimedia.org/viewvc/mediawiki/trunk/dbzip2/ ?
On 3/27/09 4:54 PM, Keisial wrote:
I never got round to completing the decompressor implementation for dbzip2, though..
The code at http://svn.wikimedia.org/viewvc/mediawiki/trunk/dbzip2/ ?
Yep. Compressor works (local and network-distributed nodes), decompressor never got finished. :)
-- brion
wikitech-l@lists.wikimedia.org