Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Thanks Tim and Tomasz. This is great news.
bilal -- Verily, with hardship comes ease.
On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Many thanks to everyone involved.
Also, in case it's of use to anyone I have a copy of the enwiki-20080103-pages-meta-history.xml dump in 7z form. Is that the backup that's beeing referred to or is it in fact 20081003?
kpw
On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days.
-tomasz
Kevin Webb wrote:
Many thanks to everyone involved.
Also, in case it's of use to anyone I have a copy of the enwiki-20080103-pages-meta-history.xml dump in 7z form. Is that the backup that's beeing referred to or is it in fact 20081003?
kpw
On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
It's in EC2 so I could get it to you in about 20 mins. Just hit me with an email off-list with the desired destination...
kpw
On Wed, Mar 10, 2010 at 10:54 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days.
-tomasz
Kevin Webb wrote:
Many thanks to everyone involved.
Also, in case it's of use to anyone I have a copy of the enwiki-20080103-pages-meta-history.xml dump in 7z form. Is that the backup that's beeing referred to or is it in fact 20081003?
kpw
On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Also, does the 20080103 dump combined with lastest 20100130 dump provide a complete edit history of Wikipedia? I'm unclear about whether the 20080103 dump was cumulative or if there was some other previous cut off point.
Is it correct to assume that future dumps will begin post 2010-01-30?
Thanks! kpw
On Wed, Mar 10, 2010 at 10:55 PM, Kevin Webb kpwebb@gmail.com wrote:
It's in EC2 so I could get it to you in about 20 mins. Just hit me with an email off-list with the desired destination...
kpw
On Wed, Mar 10, 2010 at 10:54 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days.
-tomasz
Kevin Webb wrote:
Many thanks to everyone involved.
Also, in case it's of use to anyone I have a copy of the enwiki-20080103-pages-meta-history.xml dump in 7z form. Is that the backup that's beeing referred to or is it in fact 20081003?
kpw
On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; xmldatadumps@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D
New full history en wiki snapshot is hot off the presses!
It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from
http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2
and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc).
For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits.
I'm excited to say that we seem to have it :)
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Newer snapshots super cede their old brethren. So a 20100130 already includes all of the old content of 20080103 baring and format changes.
--tomasz
Kevin Webb wrote:
Also, does the 20080103 dump combined with lastest 20100130 dump provide a complete edit history of Wikipedia? I'm unclear about whether the 20080103 dump was cumulative or if there was some other previous cut off point.
Is it correct to assume that future dumps will begin post 2010-01-30?
Thanks! kpw
On Wed, Mar 10, 2010 at 10:55 PM, Kevin Webb kpwebb@gmail.com wrote:
It's in EC2 so I could get it to you in about 20 mins. Just hit me with an email off-list with the desired destination...
kpw
On Wed, Mar 10, 2010 at 10:54 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days.
-tomasz
Kevin Webb wrote:
Many thanks to everyone involved.
Also, in case it's of use to anyone I have a copy of the enwiki-20080103-pages-meta-history.xml dump in 7z form. Is that the backup that's beeing referred to or is it in fact 20081003?
kpw
On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail.
--tomasz
Erik Zachte wrote:
I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog.
Erik Zachte
> -----Original Message----- > From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- > bounces@lists.wikimedia.org] On Behalf Of Tomasz Finc > Sent: Thursday, March 11, 2010 4:11 > To: Wikimedia developers; xmldatadumps-admin-l@lists.wikimedia.org; > xmldatadumps@lists.wikimedia.org > Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- > meta-history.xml.bz2 :D > > New full history en wiki snapshot is hot off the presses! > > It's currently being checksummed which will take a while for 280GB+ of > compressed data but for those brave souls willing to test please grab > it > from > > http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- > meta-history.xml.bz2 > > and give us feedback about its quality. This run took just over a month > and gained a huge speed up after Tims work on re-compressing ES. If we > see no hiccups with this data snapshot, I'll start mirroring it to > other > locations (internet archive, amazon public data sets, etc). > > For those not familiar, the last successful run that we've seen of this > data goes all the way back to 2008-10-03. That's over 1.5 years of > people waiting to get access to these data bits. > > I'm excited to say that we seem to have it :) > > --tomasz > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
I thought someone had concluded that the 2008 dump was broken. In other words that the dumper exited, but the dump only contained about 10% of what it ought to have contained. I've never handled it myself, so I don't know for sure.
Regardless, congratulations Tomasz and others. Getting the dump system fully working is a great milestone.
-Robert Rohde
On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days.
-tomasz
Anyone may download the file from me here:
http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
The md5sum is:
20a201afc05a4e5f2f6c3b9b7afa225c enwiki-20080103-pages-meta-history.xml.7z
The file size is:
18522193111 (~18 gigabytes)
I'm sure you will find my pipe fat enough..;-)
Brian J Mingus wrote:
On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finc <tfinc@wikimedia.org mailto:tfinc@wikimedia.org> wrote:
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days. -tomasz
Anyone may download the file from me here:
http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
The md5sum is:
20a201afc05a4e5f2f6c3b9b7afa225c enwiki-20080103-pages-meta-history.xml.7z
The file size is:
18522193111 (~18 gigabytes)
I'm sure you will find my pipe fat enough..;-)
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
That seem way too tiny to be the real thing.
--tomasz
On Wed, Mar 10, 2010 at 10:43 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Brian J Mingus wrote:
On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finc <tfinc@wikimedia.orgmailto: tfinc@wikimedia.org> wrote:
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days.
-tomasz
Anyone may download the file from me here:
http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
The md5sum is:
20a201afc05a4e5f2f6c3b9b7afa225c enwiki-20080103-pages-meta-history.xml.7z
The file size is:
18522193111 (~18 gigabytes)
I'm sure you will find my pipe fat enough..;-)
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
That seem way too tiny to be the real thing.
--tomasz
7zip has a very impressive compression ratio. From download.wikimedia.org:
- These dumps can be *very* large, uncompressing up to 100 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
That notice has not changed since I downloaded this file.. the uncompressed size could be well over a terabyte. I'm not sure how long it will take to unpack but I have just started it. I wonder what drives your intuition?
Tomasz Finc wrote:
Brian J Mingus wrote:
On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finc<tfinc@wikimedia.org mailto:tfinc@wikimedia.org> wrote:
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days. -tomasz
Anyone may download the file from me here:
http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
The md5sum is:
20a201afc05a4e5f2f6c3b9b7afa225c enwiki-20080103-pages-meta-history.xml.7z
The file size is:
18522193111 (~18 gigabytes)
I'm sure you will find my pipe fat enough..;-)
That seem way too tiny to be the real thing.
--tomasz
I also have a copy of it. The md5sum and file size are the right ones of the file that was published on downloads.wikimedia.org
I have the .sql.gz too, if you want them.
xmldatadumps-l@lists.wikimedia.org