Xmldatadumps-l March 2010

xmldatadumps-l@lists.wikimedia.org

11 participants
10 discussions

Re: [Xmldatadumps-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Tomasz Finc

Tomasz Finc wrote: > New full history en wiki snapshot is hot off the presses! > > It's currently being checksummed which will take a while for 280GB+ of > compressed data but for those brave souls willing to test please grab it > from > > http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi… > > > and give us feedback about its quality. This run took just over a month > and gained a huge speed up after Tims work on re-compressing ES. If we > see no hiccups with this data snapshot, I'll start mirroring it to other > locations (internet archive, amazon public data sets, etc). > > For those not familiar, the last successful run that we've seen of this > data goes all the way back to 2008-10-03. That's over 1.5 years of > people waiting to get access to these data bits. > > I'm excited to say that we seem to have it :) > > --tomasz We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. "65677bc275442c7579857cc26b355ded" Please verify against it before filing issues. --tomasz

14 years

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Lev Muchnik

LZMA SDK <http://www.7-zip.org/sdk.html>, provides a C-style API. The only problem I find is that it requires pooling - recurrent calls to extract pieces of of data. So, I wrapped it with a C++ stream which I feed to xerces-c SAX XML. SAX is really fun to use. And the speed is amazing (3 days to process all languages except English) . On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken <jmorken(a)shaw.ca> wrote: > > Hi, > > Is this code available to process the 7zip data on the fly? I had heard a > rumour before that 7zip required multiple passes to decompress. > > cheers, > Jamie > > > > ----- Original Message ----- > From: Lev Muchnik <levmuchnik(a)gmail.com> > Date: Tuesday, March 16, 2010 1:55 pm > Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki > Checksumming pages-meta-history.xml.bz2 :D > To: Tomasz Finc <tfinc(a)wikimedia.org> > Cc: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, > xmldatadumps-admin-l(a)lists.wikimedia.org, > Xmldatadumps-l(a)lists.wikimedia.org > > > I am entirely for 7z. In fact, once released, I'll be able to > > test the XML > > integrity right away - I process the data on the fly, > > without unpacking it > > first. > > > > > > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc > > <tfinc(a)wikimedia.org> wrote: > > > > > Kevin Webb wrote: > > > > I just managed to finish decompression. That took about 54 > > hours on an > > > > EC2 2.5x unit CPU. The final data size is 5469GB. > > > > > > > > As the process just finished I haven't been able to check the > > > > integrity of the XML, however, the bzip stream itself > > appears to be > > > > good. > > > > > > > > As was mentioned previously, it would be great if you could > > compress> > future archives using pbzib to allow for parallel > > decompression. As I > > > > understand it, the pbzip files are reverse compatible with all > > > > existing bzip2 utilities. > > > > > > Looks like the trade off is slightly larger files due to pbzip2's > > > algorithm for individual chunking. We'd have to change the > > > > > > buildFilters function in http://tinyurl.com/yjun6n5 and > > install the new > > > binary. Ubuntu already has it in 8.04 LTS making it easy. > > > > > > Any takers for the change? > > > > > > I'd also like to gauge everyones opinion on moving away from > > the large > > > file sizes of bz2 and going exclusively 7z. We'd save a huge > > amount of > > > space doing it at a slightly larger cost during compression. > > > Decompression of 7z these days is wicked fast. > > > > > > let know > > > > > > --tomasz > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Xmldatadumps-admin-l mailing list > > > Xmldatadumps-admin-l(a)lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps- > > admin-l > > > > > >

14 years, 1 month

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Felipe Ortega

Let alone that, for some of us outside USA (and even with a good connection to the EU resarch network) the download process takes, so to say, slightly more time than expected (and is prone to errors as the file gets larger). So other +1 to replace bzip with 7zip. F. --- El mar, 16/3/10, Kevin Webb <kpwebb(a)gmail.com> escribió: > De: Kevin Webb <kpwebb(a)gmail.com> > Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D > Para: "Lev Muchnik" <levmuchnik(a)gmail.com> > CC: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>, xmldatadumps-admin-l(a)lists.wikimedia.org, Xmldatadumps-l(a)lists.wikimedia.org > Fecha: martes, 16 de marzo, 2010 22:35 > Yeah, same here. I'm totally fine > with replacing bzip with 7zip as the > primary format for the dumps. Seems like it solves the > space and speed > problems together... > > I just did a quick benchmark and got a 7x improvement on > decompression > speed using 7zip over bzip using a single core, based on > actual dump > data. > > kpw > > > > On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik <levmuchnik(a)gmail.com> > wrote: > > > > I am entirely for 7z. In fact, once released, I'll be > able to test the XML > > integrity right away - I process the data on the fly, > without unpacking it > > first. > > > > > > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc <tfinc(a)wikimedia.org> > wrote: > >> > >> Kevin Webb wrote: > >> > I just managed to finish decompression. That > took about 54 hours on an > >> > EC2 2.5x unit CPU. The final data size is > 5469GB. > >> > > >> > As the process just finished I haven't been > able to check the > >> > integrity of the XML, however, the bzip > stream itself appears to be > >> > good. > >> > > >> > As was mentioned previously, it would be > great if you could compress > >> > future archives using pbzib to allow for > parallel decompression. As I > >> > understand it, the pbzip files are reverse > compatible with all > >> > existing bzip2 utilities. > >> > >> Looks like the trade off is slightly larger files > due to pbzip2's > >> algorithm for individual chunking. We'd have to > change the > >> > >> buildFilters function in http://tinyurl.com/yjun6n5 and install the new > >> binary. Ubuntu already has it in 8.04 LTS making > it easy. > >> > >> Any takers for the change? > >> > >> I'd also like to gauge everyones opinion on moving > away from the large > >> file sizes of bz2 and going exclusively 7z. We'd > save a huge amount of > >> space doing it at a slightly larger cost during > compression. > >> Decompression of 7z these days is wicked fast. > >> > >> let know > >> > >> --tomasz > >> > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Xmldatadumps-admin-l mailing list > >> Xmldatadumps-admin-l(a)lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l > > > > > > _______________________________________________ > Xmldatadumps-admin-l mailing list > Xmldatadumps-admin-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l >

14 years, 1 month

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Felipe Ortega

--- El mar, 16/3/10, Kevin Webb <kpwebb(a)gmail.com> escribió: > De: Kevin Webb <kpwebb(a)gmail.com> > Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D > Para: "Tomasz Finc" <tfinc(a)wikimedia.org> > CC: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>, xmldatadumps-admin-l(a)lists.wikimedia.org, Xmldatadumps-l(a)lists.wikimedia.org > Fecha: martes, 16 de marzo, 2010 21:10 > I just managed to finish > decompression. That took about 54 hours on an > EC2 2.5x unit CPU. The final data size is 5469GB. > > As the process just finished I haven't been able to check > the > integrity of the XML, however, the bzip stream itself > appears to be > good. > > As was mentioned previously, it would be great if you could > compress > future archives using pbzib to allow for parallel > decompression. As I > understand it, the pbzip files are reverse compatible with > all > existing bzip2 utilities. > Yes, they're :-). Regards, F. > Thanks again for all your work on this! > Kevin > > > On Tue, Mar 16, 2010 at 4:05 PM, Tomasz Finc <tfinc(a)wikimedia.org> > wrote: > > Tomasz Finc wrote: > >> New full history en wiki snapshot is hot off the > presses! > >> > >> It's currently being checksummed which will take a > while for 280GB+ of > >> compressed data but for those brave souls willing > to test please grab it > >> from > >> > >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi… > >> > >> and give us feedback about its quality. This run > took just over a month > >> and gained a huge speed up after Tims work on > re-compressing ES. If we > >> see no hiccups with this data snapshot, I'll start > mirroring it to other > >> locations (internet archive, amazon public data > sets, etc). > >> > >> For those not familiar, the last successful run > that we've seen of this > >> data goes all the way back to 2008-10-03. That's > over 1.5 years of > >> people waiting to get access to these data bits. > >> > >> I'm excited to say that we seem to have it :) > > > > So now that we've had it for a couple of days .. can I > get a status > > report from someone about its quality? > > > > Even if you had no issues please let us know so that > we start mirroring. > > > > --tomasz > > > > _______________________________________________ > > Xmldatadumps-admin-l mailing list > > Xmldatadumps-admin-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l > > > > _______________________________________________ > Xmldatadumps-admin-l mailing list > Xmldatadumps-admin-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l >

14 years, 1 month

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Tomasz Finc

Kevin Webb wrote: > I just managed to finish decompression. That took about 54 hours on an > EC2 2.5x unit CPU. The final data size is 5469GB. > > As the process just finished I haven't been able to check the > integrity of the XML, however, the bzip stream itself appears to be > good. > > As was mentioned previously, it would be great if you could compress > future archives using pbzib to allow for parallel decompression. As I > understand it, the pbzip files are reverse compatible with all > existing bzip2 utilities. Looks like the trade off is slightly larger files due to pbzip2's algorithm for individual chunking. We'd have to change the buildFilters function in http://tinyurl.com/yjun6n5 and install the new binary. Ubuntu already has it in 8.04 LTS making it easy. Any takers for the change? I'd also like to gauge everyones opinion on moving away from the large file sizes of bz2 and going exclusively 7z. We'd save a huge amount of space doing it at a slightly larger cost during compression. Decompression of 7z these days is wicked fast. let know --tomasz

14 years, 1 month

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Lev Muchnik

Hi Jamie, Looks cool! Thanks for the link. It seams to serve a different purpose, though. Looks like one can keep the data compressed and access it directly in archive. That was never my objective. The setup I described is optimized for one pass through the data. Perfect if you need to extract certain elements and do not need repeated or random reads. Lev On Tue, Mar 16, 2010 at 7:13 PM, Jamie Morken <jmorken(a)shaw.ca> wrote: > hi, > > I wonder how the zim file format: http://www.openzim.org/Main_Page > would compare to the 7-zip file in regards to size and access speed? > > > cheers, > Jamie > > > ----- Original Message ----- > From: Lev Muchnik <levmuchnik(a)gmail.com> > Date: Tuesday, March 16, 2010 2:36 pm > Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki > Checksumming pages-meta-history.xml.bz2 :D > To: Jamie Morken <jmorken(a)shaw.ca> > Cc: xmldatadumps-l(a)lists.wikimedia.org > > > LZMA SDK <http://www.7-zip.org/sdk.html>, > > provides a C-style API. The > > only problem I find is that it requires pooling - recurrent > > calls to extract > > pieces of of data. So, I wrapped it with a C++ stream which I feed > > to xerces-c SAX XML. SAX is really fun to use. And the speed is > > amazing (3 > > days to process all languages except English) . > > > > On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken > > <jmorken(a)shaw.ca> wrote: > > > > > > > > Hi, > > > > > > Is this code available to process the 7zip data on the > > fly? I had heard a > > > rumour before that 7zip required multiple passes to decompress. > > > > > > cheers, > > > Jamie > > > > > > > > > > > > ----- Original Message ----- > > > From: Lev Muchnik <levmuchnik(a)gmail.com> > > > Date: Tuesday, March 16, 2010 1:55 pm > > > Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki > > > Checksumming pages-meta-history.xml.bz2 :D > > > To: Tomasz Finc <tfinc(a)wikimedia.org> > > > Cc: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, > > > xmldatadumps-admin-l(a)lists.wikimedia.org, > > > Xmldatadumps-l(a)lists.wikimedia.org > > > > > > > I am entirely for 7z. In fact, once released, I'll be able to > > > > test the XML > > > > integrity right away - I process the data on the fly, > > > > without unpacking it > > > > first. > > > > > > > > > > > > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc > > > > <tfinc(a)wikimedia.org> wrote: > > > > > > > > > Kevin Webb wrote: > > > > > > I just managed to finish decompression. That took about 54 > > > > hours on an > > > > > > EC2 2.5x unit CPU. The final data size is 5469GB. > > > > > > > > > > > > As the process just finished I haven't been able to > > check the > > > > > > integrity of the XML, however, the bzip stream itself > > > > appears to be > > > > > > good. > > > > > > > > > > > > As was mentioned previously, it would be great if you could > > > > compress> > future archives using pbzib to allow for parallel > > > > decompression. As I > > > > > > understand it, the pbzip files are reverse compatible > > with all > > > > > > existing bzip2 utilities. > > > > > > > > > > Looks like the trade off is slightly larger files due to > > pbzip2's> > > algorithm for individual chunking. We'd have to > > change the > > > > > > > > > > buildFilters function in http://tinyurl.com/yjun6n5 and > > > > install the new > > > > > binary. Ubuntu already has it in 8.04 LTS making it easy. > > > > > > > > > > Any takers for the change? > > > > > > > > > > I'd also like to gauge everyones opinion on moving away from > > > > the large > > > > > file sizes of bz2 and going exclusively 7z. We'd save a huge > > > > amount of > > > > > space doing it at a slightly larger cost during compression. > > > > > Decompression of 7z these days is wicked fast. > > > > > > > > > > let know > > > > > > > > > > --tomasz > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Xmldatadumps-admin-l mailing list > > > > > Xmldatadumps-admin-l(a)lists.wikimedia.org > > > > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps- > > > > admin-l > > > > > > > > > > > > > > >

14 years, 1 month

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Tomasz Finc

Tomasz Finc wrote: > New full history en wiki snapshot is hot off the presses! > > It's currently being checksummed which will take a while for 280GB+ of > compressed data but for those brave souls willing to test please grab it > from > > http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi… > > and give us feedback about its quality. This run took just over a month > and gained a huge speed up after Tims work on re-compressing ES. If we > see no hiccups with this data snapshot, I'll start mirroring it to other > locations (internet archive, amazon public data sets, etc). > > For those not familiar, the last successful run that we've seen of this > data goes all the way back to 2008-10-03. That's over 1.5 years of > people waiting to get access to these data bits. > > I'm excited to say that we seem to have it :) So now that we've had it for a couple of days .. can I get a status report from someone about its quality? Even if you had no issues please let us know so that we start mirroring. --tomasz

14 years, 1 month

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Tomasz Finc

Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail. --tomasz Erik Zachte wrote: > I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. > For the record the 2008-10-03 dump existed for a short while only. > It evaporated before wikistats and many others could parse it, > so now we can finally catch up from 3.5 (!) years backlog. > > Erik Zachte > >> -----Original Message----- >> From: wikitech-l-bounces(a)lists.wikimedia.org [mailto:wikitech-l- >> bounces(a)lists.wikimedia.org] On Behalf Of Tomasz Finc >> Sent: Thursday, March 11, 2010 4:11 >> To: Wikimedia developers; xmldatadumps-admin-l(a)lists.wikimedia.org; >> xmldatadumps(a)lists.wikimedia.org >> Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- >> meta-history.xml.bz2 :D >> >> New full history en wiki snapshot is hot off the presses! >> >> It's currently being checksummed which will take a while for 280GB+ of >> compressed data but for those brave souls willing to test please grab >> it >> from >> >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- >> meta-history.xml.bz2 >> >> and give us feedback about its quality. This run took just over a month >> and gained a huge speed up after Tims work on re-compressing ES. If we >> see no hiccups with this data snapshot, I'll start mirroring it to >> other >> locations (internet archive, amazon public data sets, etc). >> >> For those not familiar, the last successful run that we've seen of this >> data goes all the way back to 2008-10-03. That's over 1.5 years of >> people waiting to get access to these data bits. >> >> I'm excited to say that we seem to have it :) >> >> --tomasz >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > _______________________________________________ > Xmldatadumps-admin-l mailing list > Xmldatadumps-admin-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l

14 years, 1 month

by jcms

-- Este mensaje le ha llegado mediante el servicio de correo electronico que ofrece Infomed para respaldar el cumplimiento de las misiones del Sistem a Nacional de Salud. La persona que envia este correo asume el compromiso de usar el servicio a tales fines y cumplir con las regulaciones establecidas Infomed: http://www.sld.cu/

14 years, 1 month

2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

by Tomasz Finc

New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi… and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz

14 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l March 2010