Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 09:42 Tomasz Finc wrote:
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
One question, Tomasz: did you use pbzip2 to compress the file?
If so, then we can decompress the 280GB file with pbzip2 more efficiently (since it compresses the data in individual chunks that can be sent to different cores/CPUs). Otherwise, plain bzip2 is preferred.
Thanks in advance.
Regards, Felipe.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Felipe Ortega wrote:
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 09:42 Tomasz Finc wrote:
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
One question, Tomasz: did you use pbzip2 to compress the file?
It's just plain bzip2. We push it through 7zip after as we've seen it reduce our full by a huge factor.
--tomasz
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Felipe Ortega" glimmer_phoenix@yahoo.es CC: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 19:23 Felipe Ortega wrote:
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org
escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08:
enwiki Checksumming pages-meta-history.xml.bz2 :D
Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org,
xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org
Fecha: jueves, 11 de marzo, 2010 09:42 Tomasz Finc wrote:
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
One question, Tomasz: did you use pbzip2 to compress
the file?
It's just plain bzip2. We push it through 7zip after as we've seen it reduce our full by a huge factor.
Indeed, though you should expect 7zip spending significant more time to compress the monster... ;-). I'm getting the bz2, just in case.
Thanks again, F.
--tomasz
I love lzma compression.
enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
Download at http://tinyurl.com/yeelbse
Enjoy!
--tomasz
Tomasz Finc wrote:
Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks Tomasz!
These are great news! Already in analysis ;)
On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tfinc@wikimedia.org wrote:
I love lzma compression.
enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
Download at http://tinyurl.com/yeelbse
Enjoy!
--tomasz
Tomasz Finc wrote:
Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Got an md5sum?
On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tfinc@wikimedia.org wrote:
I love lzma compression.
enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
Download at http://tinyurl.com/yeelbse
Enjoy!
--tomasz
Tomasz Finc wrote:
Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
You can find all the md5sums at
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt
--tomasz
Anthony wrote:
Got an md5sum?
On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc <tfinc@wikimedia.org mailto:tfinc@wikimedia.org> wrote:
I love lzma compression. enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB enwiki-20100130-pages-meta-history.xml.7z 31.9 GB Download at http://tinyurl.com/yeelbse Enjoy! --tomasz Tomasz Finc wrote: > Tomasz Finc wrote: >> New full history en wiki snapshot is hot off the presses! >> >> It's currently being checksummed which will take a while for 280GB+ of >> compressed data but for those brave souls willing to test please grab it >> from >> >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 >> >> >> and give us feedback about its quality. This run took just over a month >> and gained a huge speed up after Tims work on re-compressing ES. If we >> see no hiccups with this data snapshot, I'll start mirroring it to other >> locations (internet archive, amazon public data sets, etc). >> >> For those not familiar, the last successful run that we've seen of this >> data goes all the way back to 2008-10-03. That's over 1.5 years of >> people waiting to get access to these data bits. >> >> I'm excited to say that we seem to have it :) >> >> --tomasz > > We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. > > "65677bc275442c7579857cc26b355ded" > > Please verify against it before filing issues. > > --tomasz > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org <mailto:Wikitech-l@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org <mailto:Xmldatadumps-admin-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
I'd like to add that the md5 of the *uncompressed* file is cd4eee6d3d745ce716db2931c160ee35 . That's what I got from both the uncompressed 7z and the uncompressed bz2. They matched, whew. Uncompressing and md5ing the bz2 took well over a week. Uncompressing and md5ing the 7z took less than a day.
On Mon, Mar 29, 2010 at 8:16 PM, Tomasz Finc tfinc@wikimedia.org wrote:
You can find all the md5sums at
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt
--tomasz
Anthony wrote:
Got an md5sum?
On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc <tfinc@wikimedia.orgmailto: tfinc@wikimedia.org> wrote:
I love lzma compression.
enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
Download at http://tinyurl.com/yeelbse
Enjoy!
--tomasz
Tomasz Finc wrote: > Tomasz Finc wrote: >> New full history en wiki snapshot is hot off the presses! >> >> It's currently being checksummed which will take a while for 280GB+ of >> compressed data but for those brave souls willing to test please grab it >> from >> >>
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his... >> >> >> and give us feedback about its quality. This run took just over a month >> and gained a huge speed up after Tims work on re-compressing ES. If we >> see no hiccups with this data snapshot, I'll start mirroring it to other >> locations (internet archive, amazon public data sets, etc). >> >> For those not familiar, the last successful run that we've seen of this >> data goes all the way back to 2008-10-03. That's over 1.5 years of >> people waiting to get access to these data bits. >> >> I'm excited to say that we seem to have it :) >> >> --tomasz > > We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. > > "65677bc275442c7579857cc26b355ded" > > Please verify against it before filing issues. > > --tomasz > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org mailto:Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org mailto:Xmldatadumps-admin-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
xmldatadumps-l@lists.wikimedia.org