[Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Felipe Ortega glimmer_phoenix at yahoo.es
Wed Mar 17 23:53:12 UTC 2010


Let alone that, for some of us outside USA (and even with a good connection to the EU resarch network) the download process takes, so to say, slightly more time than expected (and is prone to errors as the file gets larger).

So other +1 to replace bzip with 7zip.

F. 

--- El mar, 16/3/10, Kevin Webb <kpwebb at gmail.com> escribió:

> De: Kevin Webb <kpwebb at gmail.com>
> Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
> Para: "Lev Muchnik" <levmuchnik at gmail.com>
> CC: "Wikimedia developers" <wikitech-l at lists.wikimedia.org>, xmldatadumps-admin-l at lists.wikimedia.org, Xmldatadumps-l at lists.wikimedia.org
> Fecha: martes, 16 de marzo, 2010 22:35
> Yeah, same here. I'm totally fine
> with replacing bzip with 7zip as the
> primary format for the dumps. Seems like it solves the
> space and speed
> problems together...
> 
> I just did a quick benchmark and got a 7x improvement on
> decompression
> speed using 7zip over bzip using a single core, based on
> actual dump
> data.
> 
> kpw
> 
> 
> 
> On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik <levmuchnik at gmail.com>
> wrote:
> >
> > I am entirely for 7z. In fact, once released, I'll be
> able to test the XML
> > integrity right away - I process the data on the fly,
> without  unpacking it
> > first.
> >
> >
> > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc <tfinc at wikimedia.org>
> wrote:
> >>
> >> Kevin Webb wrote:
> >> > I just managed to finish decompression. That
> took about 54 hours on an
> >> > EC2 2.5x unit CPU. The final data size is
> 5469GB.
> >> >
> >> > As the process just finished I haven't been
> able to check the
> >> > integrity of the XML, however, the bzip
> stream itself appears to be
> >> > good.
> >> >
> >> > As was mentioned previously, it would be
> great if you could compress
> >> > future archives using pbzib to allow for
> parallel decompression. As I
> >> > understand it, the pbzip files are reverse
> compatible with all
> >> > existing bzip2 utilities.
> >>
> >> Looks like the trade off is slightly larger files
> due to pbzip2's
> >> algorithm for individual chunking. We'd have to
> change the
> >>
> >> buildFilters function in http://tinyurl.com/yjun6n5 and install the new
> >> binary. Ubuntu already has it in 8.04 LTS making
> it easy.
> >>
> >> Any takers for the change?
> >>
> >> I'd also like to gauge everyones opinion on moving
> away from the large
> >> file sizes of bz2 and going exclusively 7z. We'd
> save a huge amount of
> >> space doing it at a slightly larger cost during
> compression.
> >> Decompression of 7z these days is wicked fast.
> >>
> >> let know
> >>
> >> --tomasz
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Xmldatadumps-admin-l mailing list
> >> Xmldatadumps-admin-l at lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
> >
> >
> 
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
> 


      




More information about the Xmldatadumps-l mailing list