New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 09:42 Tomasz Finc wrote:
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
One question, Tomasz: did you use pbzip2 to compress the file?
If so, then we can decompress the 280GB file with pbzip2 more efficiently (since it compresses the data in individual chunks that can be sent to different cores/CPUs). Otherwise, plain bzip2 is preferred.
Thanks in advance.
Regards, Felipe.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Felipe Ortega wrote:
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 09:42 Tomasz Finc wrote:
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
One question, Tomasz: did you use pbzip2 to compress the file?
It's just plain bzip2. We push it through 7zip after as we've seen it reduce our full by a huge factor.
--tomasz
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Felipe Ortega" glimmer_phoenix@yahoo.es CC: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 19:23 Felipe Ortega wrote:
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org
escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: Re: [Xmldatadumps-l] 2010-03-11 01:10:08:
enwiki Checksumming pages-meta-history.xml.bz2 :D
Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org,
xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org
Fecha: jueves, 11 de marzo, 2010 09:42 Tomasz Finc wrote:
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
One question, Tomasz: did you use pbzip2 to compress
the file?
It's just plain bzip2. We push it through 7zip after as we've seen it reduce our full by a huge factor.
Indeed, though you should expect 7zip spending significant more time to compress the monster... ;-). I'm getting the bz2, just in case.
Thanks again, F.
--tomasz
I love lzma compression.
enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
Download at http://tinyurl.com/yeelbse
Enjoy!
--tomasz
Tomasz Finc wrote:
Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Got an md5sum?
On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tfinc@wikimedia.org wrote:
I love lzma compression.
enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
Download at http://tinyurl.com/yeelbse
Enjoy!
--tomasz
Tomasz Finc wrote:
Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
"65677bc275442c7579857cc26b355ded"
Please verify against it before filing issues.
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
You can find all the md5sums at
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt
--tomasz
Anthony wrote:
Got an md5sum?
On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc <tfinc@wikimedia.org mailto:tfinc@wikimedia.org> wrote:
I love lzma compression. enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB enwiki-20100130-pages-meta-history.xml.7z 31.9 GB Download at http://tinyurl.com/yeelbse Enjoy! --tomasz Tomasz Finc wrote: > Tomasz Finc wrote: >> New full history en wiki snapshot is hot off the presses! >> >> It's currently being checksummed which will take a while for 280GB+ of >> compressed data but for those brave souls willing to test please grab it >> from >> >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 >> >> >> and give us feedback about its quality. This run took just over a month >> and gained a huge speed up after Tims work on re-compressing ES. If we >> see no hiccups with this data snapshot, I'll start mirroring it to other >> locations (internet archive, amazon public data sets, etc). >> >> For those not familiar, the last successful run that we've seen of this >> data goes all the way back to 2008-10-03. That's over 1.5 years of >> people waiting to get access to these data bits. >> >> I'm excited to say that we seem to have it :) >> >> --tomasz > > We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. > > "65677bc275442c7579857cc26b355ded" > > Please verify against it before filing issues. > > --tomasz > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org <mailto:Wikitech-l@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org <mailto:Xmldatadumps-admin-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
I'd like to add that the md5 of the *uncompressed* file is cd4eee6d3d745ce716db2931c160ee35 . That's what I got from both the uncompressed 7z and the uncompressed bz2. They matched, whew. Uncompressing and md5ing the bz2 took well over a week. Uncompressing and md5ing the 7z took less than a day.
On Mon, Mar 29, 2010 at 8:16 PM, Tomasz Finc tfinc@wikimedia.org wrote:
You can find all the md5sums at
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt
--tomasz
Anthony wrote:
Got an md5sum?
On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc <tfinc@wikimedia.orgmailto: tfinc@wikimedia.org> wrote:
I love lzma compression.
enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
Download at http://tinyurl.com/yeelbse
Enjoy!
--tomasz
Tomasz Finc wrote: > Tomasz Finc wrote: >> New full history en wiki snapshot is hot off the presses! >> >> It's currently being checksummed which will take a while for 280GB+ of >> compressed data but for those brave souls willing to test please grab it >> from >> >>
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his... >> >> >> and give us feedback about its quality. This run took just over a month >> and gained a huge speed up after Tims work on re-compressing ES. If we >> see no hiccups with this data snapshot, I'll start mirroring it to other >> locations (internet archive, amazon public data sets, etc). >> >> For those not familiar, the last successful run that we've seen of this >> data goes all the way back to 2008-10-03. That's over 1.5 years of >> people waiting to get access to these data bits. >> >> I'm excited to say that we seem to have it :) >> >> --tomasz > > We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. > > "65677bc275442c7579857cc26b355ded" > > Please verify against it before filing issues. > > --tomasz > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org mailto:Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org mailto:Xmldatadumps-admin-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 4/8/2010 4:28 PM, Anthony wrote:
I'd like to add that the md5 of the *uncompressed* file is cd4eee6d3d745ce716db2931c160ee35 . That's what I got from both the uncompressed 7z and the uncompressed bz2. They matched, whew. Uncompressing and md5ing the bz2 took well over a week. Uncompressing and md5ing the 7z took less than a day.
Dumping and parsing large XML files came up at work today which made me think of this, how big exactly is the uncompressed file?
- -Q
On Thu, Apr 8, 2010 at 7:34 PM, Q overlordq@gmail.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 4/8/2010 4:28 PM, Anthony wrote:
I'd like to add that the md5 of the *uncompressed* file is cd4eee6d3d745ce716db2931c160ee35 . That's what I got from both the uncompressed 7z and the uncompressed bz2. They matched, whew. Uncompressing and md5ing the bz2 took well over a week. Uncompressing
and
md5ing the 7z took less than a day.
Dumping and parsing large XML files came up at work today which made me think of this, how big exactly is the uncompressed file?
5.34 terabytes was the figure I got.
"7z l enwiki-20100130-pages-meta-history.xml.7z" gives an uncompressed size of 5873134833455. I assume that's bytes, and googling "5873134833455 bytes to terabytes" gives me "5.34158501 terabytes".
--- El jue, 11/3/10, Tomasz Finc tfinc@wikimedia.org escribió:
De: Tomasz Finc tfinc@wikimedia.org Asunto: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, xmldatadumps@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 04:10 New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
Really good news :-)
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
In fact, something went wrong with that one, as well. The last valid full dump (afaik) was 2008-03-03, containing data up to early January 2008.
I'm excited to say that we seem to have it :)
Let's cross our fingers. Congrats for the great job, guys!!
Felipe
--tomasz
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Tomasz Finc wrote:
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-his...
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
So now that we've had it for a couple of days .. can I get a status report from someone about its quality?
Even if you had no issues please let us know so that we start mirroring.
--tomasz
Oops, this was meant to go to the mailing lists (why are we all posting to multiple lists? shouldn't we just pick one?), so forwarding!
---------- Forwarded message ---------- From: Thomas Dalton thomas.dalton@gmail.com Date: 2010/3/16 Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D To: Tomasz Finc tfinc@wikimedia.org
This is really great! Thanks to everyone involved. Will the statistics tables be updated soon? I want to get this updated with an accurate words per article figure:
http://en.wikipedia.org/wiki/Wikipedia:Size_in_volumes
I expect the figure has changed a lot over the past 3.5 years, but I have no idea in what direction (hopefully upwards!), and that could significantly change the number of volumes. (Of course, it is a completely useless image, but it's fun!)
wikitech-l@lists.wikimedia.org