[Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Lev Muchnik levmuchnik at gmail.com
Tue Mar 16 20:54:26 UTC 2010


I am entirely for 7z. In fact, once released, I'll be able to test the XML
integrity right away - I process the data on the fly, without  unpacking it
first.


On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc <tfinc at wikimedia.org> wrote:

> Kevin Webb wrote:
> > I just managed to finish decompression. That took about 54 hours on an
> > EC2 2.5x unit CPU. The final data size is 5469GB.
> >
> > As the process just finished I haven't been able to check the
> > integrity of the XML, however, the bzip stream itself appears to be
> > good.
> >
> > As was mentioned previously, it would be great if you could compress
> > future archives using pbzib to allow for parallel decompression. As I
> > understand it, the pbzip files are reverse compatible with all
> > existing bzip2 utilities.
>
> Looks like the trade off is slightly larger files due to pbzip2's
> algorithm for individual chunking. We'd have to change the
>
> buildFilters function in http://tinyurl.com/yjun6n5 and install the new
> binary. Ubuntu already has it in 8.04 LTS making it easy.
>
> Any takers for the change?
>
> I'd also like to gauge everyones opinion on moving away from the large
> file sizes of bz2 and going exclusively 7z. We'd save a huge amount of
> space doing it at a slightly larger cost during compression.
> Decompression of 7z these days is wicked fast.
>
> let know
>
> --tomasz
>
>
>
>
>
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-l/attachments/20100316/0e7836fd/attachment-0001.htm 


More information about the Xmldatadumps-l mailing list