We do want to bail on attempts to retrieve a revision after a few tries since some revisions are irrecoverable.
What separates a recoverable from irrecoverable revision? Is it just random or are some revisions always irrecoverable? Also do you guys have a description or diagram of your database dump system? I think it would be good to share this info, maybe there is a way to make the database dump ALWAYS work! :)
cheers, Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Wednesday, July 21, 2010 3:45 pm Subject: Re: [Xmldatadumps-l] enwiki dump progress on 20100622 - failed again To: Dmitry Chichkov dchichkov@gmail.com Cc: Jamie Morken jmorken@shaw.ca, xmldatadumps-l@lists.wikimedia.org
These don't cause failure of the backups; a separate (much larger) number of failed retrieved revisions causes that.
We do want to bail on attempts to retrieve a revision after a few tries since some revisions are irrecoverable.
Ariel
Στις 21-07-2010, ημέρα Τετ, και ώρα 15:39 -0700, ο/η Dmitry Chichkov έγραψε:
20100719 4:37:21am PST # 2010-07-02 14:33:44 in-progress All pages with
complete page
edit history (.bz2)
Error 5 of allowed 5 retrieving revision text for text id
358280940! Pausing 5 seconds before retry...
Well, my comment here would be that the number of 'allowed
errors = 5'
and the 'retry delay 5 seconds' seem to be rather small. From
that it
looks like a 25 seconds database unavailability would cause backup failure. Considering that backup literally takes a month...
I'd suggest setting the error rate to something like 0.01% of the number of revisions. Also an incomplete dump (e.g. with missing revisions texts) is much much better than nothing, so it would only make sense to allow higher error rates or even make the interruption procedure manual.
To put that 0.01% error rate into perspective, according to my estimates the error rate in the lase "complete" database dump [enwiki-20100130 31.9GB/280GB] was at least ~0.4% (missing revisions texts due to backup process failures).
-- Regards, Dmitry
On Wed, Jul 21, 2010 at 4:03 AM, Jamie Morken
jmorken@shaw.ca wrote:
Hi, I was polling the
http://download.wikimedia.org/enwiki/20100622/ page during the
pages-meta-
history.xml.bz2 database dump and here is some
timestamped
output from that page showing some errors that
caused the
dump to fail. Regarding the .bz2 dump format,
Tomasz earlier
suggested removing it and using .7z. I thought
it might be
good to keep the .bz2 format due to there being
several
programs that use it (ie. wikitaxi, bzreader). 7z
format is
probably the way to go though for the future, but I
don't know if
this would fix the database dump errors.
cheers, Jamie ---------------
20100719 2:22:14am # 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
2010-07-19
09:22:11: enwiki 889057 pages (0.613/sec),
110108000 revs
(75.931/sec), 83.6% prefetched, ETA 2010-08-28
05:12:01 [max
371385750]>
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.7 GB (written) ---------------
20100719
3:07:16am PST
# 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
2010-07-19
10:07:15: enwiki 894194 pages (0.615/sec),
110399000 revs
(75.990/sec), 83.6% prefetched, ETA 2010-08-28
04:08:46 [max
371385750]>
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.9 GB (written) ---------------
20100719
3:22:17am PST
# 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
Error 2 of
allowed 5 retrieving revision text for text id
10595737!
Pausing 5 seconds before retry...
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.9 GB (written) ---------------
20100719
3:37:18am PST
# 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
Error 3 of
allowed 5 retrieving revision text for text id
13930238!
Pausing 5 seconds before retry...
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.9 GB (written) ---------------
20100719
3:52:19am PST
# 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
Error 4 of
allowed 5 retrieving revision text for text id
355313550!
Pausing 5 seconds before retry...
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.9 GB (written) ---------------
20100719
4:07:20am PST
# 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
Error 3 of
allowed 5 retrieving revision text for text id
346806445!
Pausing 5 seconds before retry...
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.9 GB (written) ---------------
20100719
4:22:21am PST
# 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
Error 4 of
allowed 5 retrieving revision text for text id
351921561!
Pausing 5 seconds before retry...
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.9 GB (written) ---------------
20100719
4:37:21am PST
# 2010-07-02
14:33:44 in-progress All pages with complete
page edit
history (.bz2)
Error 5 of
allowed 5 retrieving revision text for text id
358280940!
Pausing 5 seconds before retry...
* These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 119.9 GB (written) ---------------
20100719
4:52:24am PST
# 2010-07-19
11:37:24 failed All pages with complete page edit
history (.bz2) #6 {main} * These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and
statistical
use, most mirror sites won't want or need this.
* pages-meta-history.xml.bz2 ---------------
pages
referenced in the above errors:
---------------
http://en.wikipedia.org/w/index.php?oldid=10595737%3E
Brothers in
Arms: Road to Hill 30
"This is an
old revision of this page, as edited by Colonel
Cow (talk |
contribs) at 01:02, 17 February 2005."
---------------
http://en.wikipedia.org/w/index.php?oldid=13930238%3E
Brothers in
Arms: Road to Hill 30
"This is an
old revision of this page, as edited by
213.212.58.66 (talk) at 12:34, 19 May 2005."
---------------
http://en.wikipedia.org/w/index.php?oldid=355313550%3E
User:Peter I.
Vardy/sandbox> This is an old revision of this page, as edited by Peter I.
Vardy (talk |
contribs) at 10:53, 11 April 2010.
---------------
http://en.wikipedia.org/w/index.php?oldid=346806445%3E
Talk:Amy Shearn "This is an
old revision of this page, as edited by Yobot
(talk |
contribs) at 02:49, 28 February 2010."
---------------
http://en.wikipedia.org/w/index.php?oldid=351921561%3E
User:Ohms Law
Bot/Cleanup/Roy D. Bridges, Jr.
"This is an
old revision of this page, as edited by Ohms Law
Bot (talk |
contribs) at 06:26, 25 March 2010."
---------------
http://en.wikipedia.org/w/index.php?oldid=358280940%3E
The Tower Treasure "This is an
old revision of this page, as edited by
69.144.24.63 (talk) at 21:36, 25 April 2010."
---------------
----- Original
Message -----
From: Dmitry
Chichkov dchichkov@gmail.com
Date: Tuesday,
July 20, 2010 3:31 pm
Subject:
[Xmldatadumps-l] enwiki dump progress on 20100622 -
failed again To:
xmldatadumps-l@lists.wikimedia.org
> Subj:
http://download.wikimedia.org/enwiki/20100622/%3E%C2%A0%C2%A0%C2%A0%C2%A0%C2... >
> Is there
anything that can be done to alleviate that
problem? > > By the way,
what's the point of producing .bz2 version of
the > pages-meta-
history.xml dump? Is it easier on the system to
> produce .bz2 > first and
.7z after that? From the user's perspective I can
tell > that .7z is > all I need,
there is simply no point in working with .bz2
(if > .7z is > available). > > -- Regards, Dmitry >
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l