For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made.
The replacement server is on order and we are watching that closely.
We have also been working on deploying a server to run one round of dumps in the interrim.
Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer).
Ariel
Great news! Thanks for the update and thanks for all you guys' work getting it beaten back into shape. Keeping fingers crossed for all going well on the transfer...
-- brion On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" ariel@wikimedia.org wrote:
For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made.
The replacement server is on order and we are watching that closely.
We have also been working on deploying a server to run one round of dumps in the interrim.
Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer).
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
+1 Diederik
On 2010-12-14, at 12:02, Brion Vibber brion@pobox.com wrote:
Great news! Thanks for the update and thanks for all you guys' work getting it beaten back into shape. Keeping fingers crossed for all going well on the transfer...
-- brion On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" ariel@wikimedia.org wrote:
For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made.
The replacement server is on order and we are watching that closely.
We have also been working on deploying a server to run one round of dumps in the interrim.
Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer).
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks.
Double good news: http://lists.wikimedia.org/pipermail/foundation-l/2010-December/063088.html
2010/12/14 Ariel T. Glenn ariel@wikimedia.org
For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made.
The replacement server is on order and we are watching that closely.
We have also been working on deploying a server to run one round of dumps in the interrim.
Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer).
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Good news, but looking form a professional point of view having them just on array will be leading to such outages. Any idea to have a tape backup or mirror?
masti
On 12/15/2010 08:57 PM, Ariel T. Glenn wrote:
We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Currently the files have been copied off of the server onto a backup host, which is the only reason we feel safe about serving them again.
We will be getting a new host (it is due to be shipped soon) which will host the live data. The current server will have a backup copy. That is the short term answer to your question. In the longer term we expect to have a redundant copy elsewhere and cease to rely on dataset1 whatsoever.
We are interested in other mirrors of the dumps; see
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
Ariel
Στις 15-12-2010, ημέρα Τετ, και ώρα 21:16 +0100, ο/η masti έγραψε:
Good news, but looking form a professional point of view having them just on array will be leading to such outages. Any idea to have a tape backup or mirror?
masti
On 12/15/2010 08:57 PM, Ariel T. Glenn wrote:
We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We are interested in other mirrors of the dumps; see
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
On the talk page, it says "torrents are useful to save bandwidth, which is not our problem". If bandwidth is not the problem, then what *is* the problem?
If the problem is just to get someone to store the data on hard drives, then it's a much easier problem than actually *hosting* that data.
Στις 15-12-2010, ημέρα Τετ, και ώρα 15:57 -0500, ο/η Anthony έγραψε:
On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We are interested in other mirrors of the dumps; see
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
On the talk page, it says "torrents are useful to save bandwidth, which is not our problem". If bandwidth is not the problem, then what *is* the problem?
If the problem is just to get someone to store the data on hard drives, then it's a much easier problem than actually *hosting* that data.
We certainly want people to host it as well. It's not a matter of bandwidth but of protection: if someone can't get to our copy for whatever reason, another copy is accessible.
Ariel
On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We certainly want people to host it as well. It's not a matter of bandwidth but of protection: if someone can't get to our copy for whatever reason, another copy is accessible.
Is there a copy in Amsterdam? Seems like that would be the most obvious choice to put a backup as WMF already has a lot of servers there.
Στις 15-12-2010, ημέρα Τετ, και ώρα 22:50 +0100, ο/η Bryan Tong Minh έγραψε:
On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We certainly want people to host it as well. It's not a matter of bandwidth but of protection: if someone can't get to our copy for whatever reason, another copy is accessible.
Is there a copy in Amsterdam? Seems like that would be the most obvious choice to put a backup as WMF already has a lot of servers there.
We want people besides us to host it. We expect to put a copy at the new data center (at least), as well.
Ariel
On Wed, Dec 15, 2010 at 4:56 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We want people besides us to host it. We expect to put a copy at the new data center (at least), as well.
Does anyone know if the Wikipedia XML Data AWS Public Dataset [1] is being routinely updated? It's showing a last update of "September 29, 2009 1:09 AM GMT", but perhaps that's just the last update to the dataset metadata? I guess I could mount the EBS volume to check myself... It might be nice if the database dumps were included as well I guess.
//Ed
On 12/15/2010 09:30 PM, Ariel T. Glenn wrote:
We are interested in other mirrors of the dumps; see
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
Just as a small-scale experiment, I tried to mirror the Faroese (fowiki) and Sami (sewiki) language projects. But "wget -m" says that timestamps are turned off, so it keeps downloading the same files again. Is this an error at my side or at the server side?
This happens for some files, but not for all. Here is one example:
--2010-12-15 23:59:54-- http://download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pa... Reusing existing connection to download.wikimedia.org:80. HTTP request sent, awaiting response... 200 OK Length: 95974 (94K) [application/octet-stream] Last-modified header missing -- time-stamps turned off. --2010-12-15 23:59:54-- http://download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pa... Reusing existing connection to download.wikimedia.org:80. HTTP request sent, awaiting response... 200 OK Length: 95974 (94K) [application/octet-stream] Saving to: `download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pages-meta-history.xml.bz2'
100%[======================================>] 95,974 156K/s in 0.6s
Good work.
2010/12/15 Ariel T. Glenn ariel@wikimedia.org
We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Yeah, great work Ariel. Thanks a lot for the effort.
Best, F.
--- El mié, 15/12/10, Ariel T. Glenn ariel@wikimedia.org escribió:
De: Ariel T. Glenn ariel@wikimedia.org Asunto: Re: [Xmldatadumps-l] dataset1, xml dumps Para: wikitech-l@lists.wikimedia.org CC: xmldatadumps-l@lists.wikimedia.org Fecha: miércoles, 15 de diciembre, 2010 20:57 We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Ariel T. Glenn <ariel <at> wikimedia.org> writes:
We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Hi, thank you for working so hard on this issue, but I'm still having trouble with the latest en.wikipedia dump, however. I downloaded http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages- articles.xml.bz2 and am running into trouble decompressing.
In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
[2752: huff+mtf data integrity (CRC) error in data
I ran bzip2recover & then bzip2 -t rec* and got the following:
bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC) error in data bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC) error in data bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC) error in data
Have you checked the md5sum?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
Ariel T. Glenn <ariel <at> wikimedia.org> writes:
We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Hi, thank you for working so hard on this issue, but I'm still having trouble with the latest en.wikipedia dump, however. I downloaded http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages- articles.xml.bz2 and am running into trouble decompressing.
In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
[2752: huff+mtf data integrity (CRC) error in data
I ran bzip2recover & then bzip2 -t rec* and got the following:
bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC) error in data bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC) error in data bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC) error in data
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt ).
I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm?
On Thu, Dec 16, 2010 at 5:41 PM, emijrp emijrp@gmail.com wrote:
Have you checked the md5sum?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
Ariel T. Glenn <ariel <at> wikimedia.org> writes:
We now have a copy of the dumps on a backup host. Although we are
still
resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Hi, thank you for working so hard on this issue, but I'm still having trouble with the latest en.wikipedia dump, however. I downloaded http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages- articles.xml.bz2 and am running into trouble decompressing.
In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
[2752: huff+mtf data integrity (CRC) error in data
I ran bzip2recover & then bzip2 -t rec* and got the following:
bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
If the md5s don't match, the files are obviously different, I mean, one of them is corrupt.
What is the size of your local file? I use to download dumps with wget UNIX command and I don't get errors. If you are using FAT32, the file size is limited to 2 GB and the file is truncated. Is your case?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt ).
I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm?
On Thu, Dec 16, 2010 at 5:41 PM, emijrp emijrp@gmail.com wrote:
Have you checked the md5sum?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
Ariel T. Glenn <ariel <at> wikimedia.org> writes:
We now have a copy of the dumps on a backup host. Although we are
still
resolving hardware issues on the XML dumps server, we think it is
safe
enough to serve the existing dumps read-only. DNS was updated to
that
effect already; people should see the dumps within the hour.
Ariel
Hi, thank you for working so hard on this issue, but I'm still having trouble with the latest en.wikipedia dump, however. I downloaded http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages- articles.xml.bz2 and am running into trouble decompressing.
In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
[2752: huff+mtf data integrity (CRC) error in data
I ran bzip2recover & then bzip2 -t rec* and got the following:
bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I've been downloading this file (using wget on ubuntu or fetch on FreeBSD) with no issues for years. The current one is 6.2GB as it should be.
On Thu, Dec 16, 2010 at 5:53 PM, emijrp emijrp@gmail.com wrote:
If the md5s don't match, the files are obviously different, I mean, one of them is corrupt.
What is the size of your local file? I use to download dumps with wget UNIX command and I don't get errors. If you are using FAT32, the file size is limited to 2 GB and the file is truncated. Is your case?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on
http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
).
I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm?
On Thu, Dec 16, 2010 at 5:41 PM, emijrp emijrp@gmail.com wrote:
Have you checked the md5sum?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
Ariel T. Glenn <ariel <at> wikimedia.org> writes:
We now have a copy of the dumps on a backup host. Although we are
still
resolving hardware issues on the XML dumps server, we think it is
safe
enough to serve the existing dumps read-only. DNS was updated to
that
effect already; people should see the dumps within the hour.
Ariel
Hi, thank you for working so hard on this issue, but I'm still having trouble with the latest en.wikipedia dump, however. I downloaded http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages- articles.xml.bz2 and am running into trouble decompressing.
In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
[2752: huff+mtf data integrity (CRC) error in data
I ran bzip2recover & then bzip2 -t rec* and got the following:
bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I was able to unzip a copy of the file on another host (taken from the same location) without problems. On the download host itself I get the correct md5sum: 7a4805475bba1599933b3acd5150bd4d
Ariel
Στις 16-12-2010, ημέρα Πεμ, και ώρα 17:48 -0500, ο/η Gabriel Weinberg έγραψε:
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt ).
I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm?
On Thu, Dec 16, 2010 at 5:41 PM, emijrp emijrp@gmail.com wrote:
Have you checked the md5sum?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
Ariel T. Glenn <ariel <at> wikimedia.org> writes:
We now have a copy of the dumps on a backup host. Although we are
still
resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour.
Ariel
Hi, thank you for working so hard on this issue, but I'm still having trouble with the latest en.wikipedia dump, however. I downloaded http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages- articles.xml.bz2 and am running into trouble decompressing.
In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
[2752: huff+mtf data integrity (CRC) error in data
I ran bzip2recover & then bzip2 -t rec* and got the following:
bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thx--I guess I'll try again--third time's the charm I suppose :)
Sorry to waste your time,
Gabriel
On Thu, Dec 16, 2010 at 6:13 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
I was able to unzip a copy of the file on another host (taken from the same location) without problems. On the download host itself I get the correct md5sum: 7a4805475bba1599933b3acd5150bd4d
Ariel
Στις 16-12-2010, ημέρα Πεμ, και ώρα 17:48 -0500, ο/η Gabriel Weinberg έγραψε:
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on
http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
).
I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm?
On Thu, Dec 16, 2010 at 5:41 PM, emijrp emijrp@gmail.com wrote:
Have you checked the md5sum?
2010/12/16 Gabriel Weinberg yegg@alum.mit.edu
Ariel T. Glenn <ariel <at> wikimedia.org> writes:
We now have a copy of the dumps on a backup host. Although we are
still
resolving hardware issues on the XML dumps server, we think it is
safe
enough to serve the existing dumps read-only. DNS was updated to
that
effect already; people should see the dumps within the hour.
Ariel
Hi, thank you for working so hard on this issue, but I'm still having trouble with the latest en.wikipedia dump, however. I downloaded http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages- articles.xml.bz2 and am running into trouble decompressing.
In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
[2752: huff+mtf data integrity (CRC) error in data
I ran bzip2recover & then bzip2 -t rec* and got the following:
bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
(CRC)
error in data
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Gabriel Weinberg wrote:
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt ).
I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm?
I downloaded the right file without problems.
You can also try downloading it from http://archivos.wikimedia-es.org/mirror/wikimedia/dumps/enwiki/2010101/enwik...
Google donated storage space for backups for XML dumps. Accordingly, a copy of the latest complete dump for each project is being copied over (public files only). We expect to run similar copies once every two weeks, keeping the four latest copies as well as one permanent copy at every six month interval. That can be adjusted as we see how things go.
Ariel
Ariel T. Glenn wrote:
Google donated storage space for backups for XML dumps. Accordingly, a copy of the latest complete dump for each project is being copied over (public files only). We expect to run similar copies once every two weeks, keeping the four latest copies as well as one permanent copy at every six month interval. That can be adjusted as we see how things go.
Ariel
Are they readable from somewhere? Apparently, in order to read them you need to sign up a list and wait for an invitation, available only for US developers.
I sent mail immediately after my initiual mail to these lists, to find out whether we can make them readable to the public and whether there would be a fee, etc. As soon as I have more information, I will pass it on. At the least this gives WMF one more copy. Of course it would be best if it gave everyone one more copy.
Ariel
Στις 20-12-2010, ημέρα Δευ, και ώρα 17:41 +0100, ο/η Platonides έγραψε:
Ariel T. Glenn wrote:
Google donated storage space for backups for XML dumps. Accordingly, a copy of the latest complete dump for each project is being copied over (public files only). We expect to run similar copies once every two weeks, keeping the four latest copies as well as one permanent copy at every six month interval. That can be adjusted as we see how things go.
Ariel
Are they readable from somewhere? Apparently, in order to read them you need to sign up a list and wait for an invitation, available only for US developers.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The new host Dataset2 is now up and running and serving XML dumps. Those of you paying attention to DNS entries should see the change within the hour. We are not generating new dumps yet but expect to do so soon.
Ariel
Hi,
That is great news, that you for all the hard work you have done on this and most of all Seasons Greetings, Merry Christmas, and Happy New Year! :)
best regards, Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Friday, December 24, 2010 10:42 am Subject: Re: [Xmldatadumps-l] [Wikitech-l] dataset1, xml dumps To: Wikimedia developers wikitech-l@lists.wikimedia.org Cc: xmldatadumps-l@lists.wikimedia.org
The new host Dataset2 is now up and running and serving XML dumps. Those of you paying attention to DNS entries should see the change within the hour. We are not generating new dumps yet but expect to do so soon.
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
So "soon" took longer than I would have liked. However, we are up and running with the new code. I have started a few processes going and over the next few days I will ramp it up to the usual number. In particular I want to start a separate job for the larger wikis so that the smaller jobs don't get trapped behind them.
Guess I'd better go update the various pages on wikitech now.
Ariel
Στις 24-12-2010, ημέρα Παρ, και ώρα 20:42 +0200, ο/η Ariel T. Glenn έγραψε:
The new host Dataset2 is now up and running and serving XML dumps. Those of you paying attention to DNS entries should see the change within the hour. We are not generating new dumps yet but expect to do so soon.
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 10/01/11 22:13, Ariel T. Glenn wrote:
So "soon" took longer than I would have liked. However, we are up and running with the new code. I have started a few processes going and over the next few days I will ramp it up to the usual number. In particular I want to start a separate job for the larger wikis so that the smaller jobs don't get trapped behind them.
Guess I'd better go update the various pages on wikitech now.
Ariel
Thanks Ariel, that's good to hear.
Would it be possible to take this a step further, and for a single job to be started up just for enwiki?
enwiki is unique among all the dumps in that it is the only one that regularly fails more often than it succeeds; even partial dumps are better than none, and enwiki also takes longer than any other dump before it (typically) fails, so retrying it more aggressively than others -- and independently of them, so it does not hold the other wikis up -- would seem appropriate.
Thus, under this proposal, there would be three jobs running:
* enwiki * other large wikis * all small wikis
-- Neil
Στις 11-01-2011, ημέρα Τρι, και ώρα 10:16 +0000, ο/η Neil Harris έγραψε:
On 10/01/11 22:13, Ariel T. Glenn wrote:
So "soon" took longer than I would have liked. However, we are up and running with the new code. I have started a few processes going and over the next few days I will ramp it up to the usual number. In particular I want to start a separate job for the larger wikis so that the smaller jobs don't get trapped behind them.
Guess I'd better go update the various pages on wikitech now.
Ariel
Thanks Ariel, that's good to hear.
Would it be possible to take this a step further, and for a single job to be started up just for enwiki?
enwiki is unique among all the dumps in that it is the only one that regularly fails more often than it succeeds; even partial dumps are better than none, and enwiki also takes longer than any other dump before it (typically) fails, so retrying it more aggressively than others -- and independently of them, so it does not hold the other wikis up -- would seem appropriate.
Thus, under this proposal, there would be three jobs running:
- enwiki
- other large wikis
- all small wikis
-- Neil
Ah yes, sorry that wasn't clear from the earlier message. I already pulled enwiki oout of the main list and it will run as a bunch of smaller parallel jobs on its own host.
Ariel
You may be noticing a "recombine" step for several files on the recent dumps which simply seems to list the same file again. That's a bug not a feature; fortunately it doesn't impact the files themselves. I have fixed the configuration file so that it should no longer claim to run these, as they are for the parallel run function which is not needed on the smaller wikis.
I'm thinking about whether or not to clean up the index.html and md5sums on these to remove the bogus lines. Doing the index files would be a bit tedious.
Ariel
wikitech-l@lists.wikimedia.org