[Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Tomasz Finc tfinc at wikimedia.org
Tue Mar 16 20:45:24 UTC 2010


Kevin Webb wrote:
> I just managed to finish decompression. That took about 54 hours on an
> EC2 2.5x unit CPU. The final data size is 5469GB.
> 
> As the process just finished I haven't been able to check the
> integrity of the XML, however, the bzip stream itself appears to be
> good.
> 
> As was mentioned previously, it would be great if you could compress
> future archives using pbzib to allow for parallel decompression. As I
> understand it, the pbzip files are reverse compatible with all
> existing bzip2 utilities.

Looks like the trade off is slightly larger files due to pbzip2's 
algorithm for individual chunking. We'd have to change the

buildFilters function in http://tinyurl.com/yjun6n5 and install the new 
binary. Ubuntu already has it in 8.04 LTS making it easy.

Any takers for the change?

I'd also like to gauge everyones opinion on moving away from the large 
file sizes of bz2 and going exclusively 7z. We'd save a huge amount of 
space doing it at a slightly larger cost during compression. 
Decompression of 7z these days is wicked fast.

let know

--tomasz








More information about the Xmldatadumps-l mailing list