Let alone that, for some of us outside USA (and even with a good connection to the EU resarch network) the download process takes, so to say, slightly more time than expected (and is prone to errors as the file gets larger).
So other +1 to replace bzip with 7zip.
F.
--- El mar, 16/3/10, Kevin Webb kpwebb@gmail.com escribió:
De: Kevin Webb kpwebb@gmail.com Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Lev Muchnik" levmuchnik@gmail.com CC: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org Fecha: martes, 16 de marzo, 2010 22:35 Yeah, same here. I'm totally fine with replacing bzip with 7zip as the primary format for the dumps. Seems like it solves the space and speed problems together...
I just did a quick benchmark and got a 7x improvement on decompression speed using 7zip over bzip using a single core, based on actual dump data.
kpw
On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik levmuchnik@gmail.com wrote:
I am entirely for 7z. In fact, once released, I'll be
able to test the XML
integrity right away - I process the data on the fly,
without unpacking it
first.
On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc tfinc@wikimedia.org
wrote:
Kevin Webb wrote:
I just managed to finish decompression. That
took about 54 hours on an
EC2 2.5x unit CPU. The final data size is
5469GB.
As the process just finished I haven't been
able to check the
integrity of the XML, however, the bzip
stream itself appears to be
good.
As was mentioned previously, it would be
great if you could compress
future archives using pbzib to allow for
parallel decompression. As I
understand it, the pbzip files are reverse
compatible with all
existing bzip2 utilities.
Looks like the trade off is slightly larger files
due to pbzip2's
algorithm for individual chunking. We'd have to
change the
buildFilters function in http://tinyurl.com/yjun6n5 and install the new binary. Ubuntu already has it in 8.04 LTS making
it easy.
Any takers for the change?
I'd also like to gauge everyones opinion on moving
away from the large
file sizes of bz2 and going exclusively 7z. We'd
save a huge amount of
space doing it at a slightly larger cost during
compression.
Decompression of 7z these days is wicked fast.
let know
--tomasz
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l