----- Original Message -----
From: Lev Muchnik <
levmuchnik@gmail.com>
Date: Tuesday, March 16, 2010 1:55 pm
Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
To: Tomasz Finc <
tfinc@wikimedia.org>
Cc: Wikimedia developers <
wikitech-l@lists.wikimedia.org>,
xmldatadumps-admin-l@lists.wikimedia.org,
Xmldatadumps-l@lists.wikimedia.org
> I am entirely for 7z. In fact, once released, I'll be able to
> test the XML
> integrity right away - I process the data on the fly,
> without unpacking it
> first.
>
>
> On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc
> <
tfinc@wikimedia.org> wrote:
>
> > Kevin Webb wrote:
> > > I just managed to finish decompression. That took about 54
> hours on an
> > > EC2 2.5x unit CPU. The final data size is 5469GB.
> > >
> > > As the process just finished I haven't been able to check the
> > > integrity of the XML, however, the bzip stream itself
> appears to be
> > > good.
> > >
> > > As was mentioned previously, it would be great if you could
> compress> > future archives using pbzib to allow for parallel
> decompression. As I
> > > understand it, the pbzip files are reverse compatible with all
> > > existing bzip2 utilities.
> >
> > Looks like the trade off is slightly larger files due to pbzip2's
> > algorithm for individual chunking. We'd have to change the
> >
> > buildFilters function in
http://tinyurl.com/yjun6n5 and
> install the new
> > binary. Ubuntu already has it in 8.04 LTS making it easy.
> >
> > Any takers for the change?
> >
> > I'd also like to gauge everyones opinion on moving away from
> the large
> > file sizes of bz2 and going exclusively 7z. We'd save a huge
> amount of
> > space doing it at a slightly larger cost during compression.
> > Decompression of 7z these days is wicked fast.
> >
> > let know
> >
> > --tomasz
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Xmldatadumps-admin-l mailing list
> >
Xmldatadumps-admin-l@lists.wikimedia.org> >
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-
> admin-l
> >
>