Let alone that, for some of us outside USA (and even with a good connection to the EU
resarch network) the download process takes, so to say, slightly more time than expected
(and is prone to errors as the file gets larger).
So other +1 to replace bzip with 7zip.
F.
--- El mar, 16/3/10, Kevin Webb <kpwebb(a)gmail.com> escribió:
De: Kevin Webb <kpwebb(a)gmail.com>
Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming
pages-meta-history.xml.bz2 :D
Para: "Lev Muchnik" <levmuchnik(a)gmail.com>
CC: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>rg>,
xmldatadumps-admin-l(a)lists.wikimedia.org, Xmldatadumps-l(a)lists.wikimedia.org
Fecha: martes, 16 de marzo, 2010 22:35
Yeah, same here. I'm totally fine
with replacing bzip with 7zip as the
primary format for the dumps. Seems like it solves the
space and speed
problems together...
I just did a quick benchmark and got a 7x improvement on
decompression
speed using 7zip over bzip using a single core, based on
actual dump
data.
kpw
On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik <levmuchnik(a)gmail.com>
wrote:
I am entirely for 7z. In fact, once released, I'll be
able to test the XML
integrity right away - I process the data on the
fly,
without unpacking it
first.
On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc <tfinc(a)wikimedia.org>
wrote:
>
> Kevin Webb wrote:
> > I just managed to finish decompression. That
took about 54 hours on an
> > EC2 2.5x unit CPU. The final data size
is
5469GB.
> >
> > As the process just finished I haven't been
able to check the
> > integrity of the XML, however, the bzip
stream itself appears to be
> > good.
> >
> > As was mentioned previously, it would be
great if you could compress
> > future archives using pbzib to allow for
parallel decompression. As I
> > understand it, the pbzip files are
reverse
compatible with all
> > existing bzip2 utilities.
>
> Looks like the trade off is slightly larger files
due to pbzip2's
> algorithm for individual chunking. We'd
have to
change the
>
> buildFilters function in
http://tinyurl.com/yjun6n5 and install the new
> binary. Ubuntu already has it in 8.04 LTS making
it easy.
>
> Any takers for the change?
>
> I'd also like to gauge everyones opinion on moving
away from the large
> file sizes of bz2 and going exclusively 7z.
We'd
save a huge amount of
> space doing it at a slightly larger cost
during
compression.
Decompression of 7z these days is wicked fast.
let know
--tomasz
_______________________________________________
Xmldatadumps-admin-l mailing list
Xmldatadumps-admin-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
_______________________________________________
Xmldatadumps-admin-l mailing list
Xmldatadumps-admin-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l