Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

26 Mar 2009


      On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen nospam@vyznev.net wrote:
...
ERSEK Laszlo wrote:
...
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to
split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations,
bzip2recover is, in fact, not much more than a huge single threaded
bit-shifter, which even makes two passes over the dump. (IIRC, the first
pass shifts over the whole dump to find bzip2 block delimiteres, then the
second pass shifts the blocks found previously into byte-aligned, separate
bzip2 streams.)
Hmm?  Admittedly, I don't know the bzip2 format very well, but as far as
I understand it, there should be no bit-shifting involved: each block in
the stream is a completely independent, self-contained sequence of bytes.
I believe the point is that each block is a self-contained sequence of
bits not bytes, so a block can terminate in the middle of a byte.  The
next block is appended immediately (if I understand correctly), so
block boundaries do not necessarily align to byte boundaries.  Hence
the need to do bit shifting.
-Robert Rohde

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump