ERSEK Laszlo wrote:
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations, bzip2recover is, in fact, not much more than a huge single threaded bit-shifter, which even makes two passes over the dump. (IIRC, the first pass shifts over the whole dump to find bzip2 block delimiteres, then the second pass shifts the blocks found previously into byte-aligned, separate bzip2 streams.)
Hmm? Admittedly, I don't know the bzip2 format very well, but as far as I understand it, there should be no bit-shifting involved: each block in the stream is a completely independent, self-contained sequence of bytes.