Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

27 Mar 2009

On 3/26/09 10:58 AM, ERSEK Laszlo wrote:
...
  ** 1. If the export process uses dbzip2 to compress
the dump, and dbzip2's
 MO is to compress input blocks independently, then to bit-shift the
 resulting compressed blocks (= single-block bzip2 streams) back into a
 single multi-block bzip2 stream, so that the resulting file is
 bit-identical to what bzip2 would produce, then the export process wastes
 (CPU) time. Bunzip2 can decompress concatenated bzip2 streams. In exchange
 for a small size penalty, the dumper could just concatenate the
 single-block bzip2 streams, saving a lot of cycles. 
It's been years since I poked it seriously so I don't recall any exact 
figures, but I doubt it's very many cycles, and mass bit-shifting is 
likely trivial to optimize should anyone feel it necessary.

More importantly, not every decompressor will decompress concatenated 
streams. Dictating which decoder end-users should use is not cool. :)

...
  ** 2. If dump.bz2 was single-block, many-stream (as
opposed to the current
 many-block, single-stream), then people on the importing end could speed
 up *decompression* with pbzip2. 
Lack of compatibility with other tools makes this format undesirable; 
further note that a smarter decompressor could act as bzip2recover does 
to estimate block boundaries and decompress them speculatively. In the 
rare case of an incorrect match, you've only lost one to two blocks' 
worth of time.

I never got round to completing the decompressor implementation for 
dbzip2, though.

...
  ** 3. Even if dump2.bz2 stays single-stream, *or* it
becomes multi-stream
 *but* is available only from a pipe or socket, decompression can still be
 sped up by way of lbzip2 (which I wrote, and am promoting here). Since
 it's written in strict adherence to the Single UNIX Specification, Version
 2, it's available on Cygwin too, and should work on the Mac. 
Awesome!

-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump