> We do want to bail on attempts to retrieve a revision after a
> few tries
> since some revisions are irrecoverable.
What separates a recoverable from irrecoverable revision? Is it just random or are some revisions always irrecoverable? Also do you guys have a description or diagram of your database dump system? I think it would be good to share this info, maybe there is a way to make the database dump ALWAYS work! :)
cheers,
Jamie
----- Original Message -----
From: "Ariel T. Glenn" <ariel@wikimedia.org>
Date: Wednesday, July 21, 2010 3:45 pm
Subject: Re: [Xmldatadumps-l] enwiki dump progress on 20100622 - failed again
To: Dmitry Chichkov <dchichkov@gmail.com>
Cc: Jamie Morken <jmorken@shaw.ca>, xmldatadumps-l@lists.wikimedia.org
> These don't cause failure of the backups; a separate (much larger)
> number of failed retrieved revisions causes that.
>
> We do want to bail on attempts to retrieve a revision after a
> few tries
> since some revisions are irrecoverable.
>
> Ariel
>
> Στις 21-07-2010, ημέρα Τετ, και ώρα 15:39 -0700, ο/η Dmitry Chichkov
> έγραψε:
> >
> > >> 20100719 4:37:21am PST
> > >> # 2010-07-02 14:33:44 in-progress All pages with
> complete page
> > edit history (.bz2)
> > >> Error 5 of allowed 5 retrieving revision text for text id
> > 358280940! Pausing 5 seconds before retry...
> >
> > Well, my comment here would be that the number of 'allowed
> errors = 5'
> > and the 'retry delay 5 seconds' seem to be rather small. From
> that it
> > looks like a 25 seconds database unavailability would cause backup
> > failure. Considering that backup literally takes a month...
> >
> > I'd suggest setting the error rate to something like 0.01% of the
> > number of revisions. Also an incomplete dump (e.g. with missing
> > revisions texts) is much much better than nothing, so it would only
> > make sense to allow higher error rates or even make the interruption
> > procedure manual.
> >
> > To put that 0.01% error rate into perspective, according to my
> > estimates the error rate in the lase "complete" database dump
> > [enwiki-20100130 31.9GB/280GB] was at least ~0.4% (missing revisions
> > texts due to backup process failures).
> >
> > -- Regards, Dmitry
> >
> >
> >
> >
> >
> > On Wed, Jul 21, 2010 at 4:03 AM, Jamie Morken
> <jmorken@shaw.ca> wrote:
> >
> >
> > Hi,
> >
> > I was polling the
> >
> http://download.wikimedia.org/enwiki/20100622/ page during the
> > pages-meta-
> history.xml.bz2 database dump and here is some
> > timestamped
> output from that page showing some errors that
> > caused the
> dump to fail. Regarding the .bz2 dump format,
> > Tomasz earlier
> suggested removing it and using .7z. I thought
> > it might be
> good to keep the .bz2 format due to there being
> > several
> programs that use it (ie. wikitaxi, bzreader). 7z
> > format is
> probably the way to go though for the future, but I
> > don't know if
> this would fix the database dump errors.
> >
> > cheers,
> > Jamie
> >
> >
> > ---------------
> --------------------------------
> >
> > 20100719 2:22:14am
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > 2010-07-19
> 09:22:11: enwiki 889057 pages (0.613/sec),
> > 110108000 revs
> (75.931/sec), 83.6% prefetched, ETA 2010-08-28
> > 05:12:01 [max
> 371385750]>
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.7 GB (written)
> >
> >
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 3:07:16am PST
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > 2010-07-19
> 10:07:15: enwiki 894194 pages (0.615/sec),
> > 110399000 revs
> (75.990/sec), 83.6% prefetched, ETA 2010-08-28
> > 04:08:46 [max
> 371385750]>
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.9 GB (written)
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 3:22:17am PST
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > Error 2 of
> allowed 5 retrieving revision text for text id
> > 10595737!
> Pausing 5 seconds before retry...
> >
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.9 GB (written)
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 3:37:18am PST
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > Error 3 of
> allowed 5 retrieving revision text for text id
> > 13930238!
> Pausing 5 seconds before retry...
> >
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.9 GB (written)
> >
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 3:52:19am PST
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > Error 4 of
> allowed 5 retrieving revision text for text id
> > 355313550!
> Pausing 5 seconds before retry...
> >
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.9 GB (written)
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 4:07:20am PST
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > Error 3 of
> allowed 5 retrieving revision text for text id
> > 346806445!
> Pausing 5 seconds before retry...
> >
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.9 GB (written)
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 4:22:21am PST
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > Error 4 of
> allowed 5 retrieving revision text for text id
> > 351921561!
> Pausing 5 seconds before retry...
> >
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.9 GB (written)
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 4:37:21am PST
> >
> > # 2010-07-02
> 14:33:44 in-progress All pages with complete
> > page edit
> history (.bz2)
> > Error 5 of
> allowed 5 retrieving revision text for text id
> > 358280940!
> Pausing 5 seconds before retry...
> >
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2 119.9 GB (written)
> >
> > ---------------
> --------------------------------
> >
> > 20100719
> 4:52:24am PST
> >
> > # 2010-07-19
> 11:37:24 failed All pages with complete page edit
> > history (.bz2)
> > #6 {main}
> >
> > * These dumps can be *very* large, uncompressing up to 20
> > times the
> archive download size. Suitable for archival and
> > statistical
> use, most mirror sites won't want or need this.
> > * pages-meta-history.xml.bz2
> >
> > ---------------
> --------------------------------
> >
> >
> >
> >
> > pages
> referenced in the above errors:
> >
> > ---------------
> --------------------------------
> >
> >
> http://en.wikipedia.org/w/index.php?oldid=10595737>
> > Brothers in
> Arms: Road to Hill 30
> > "This is an
> old revision of this page, as edited by Colonel
> > Cow (talk |
> contribs) at 01:02, 17 February 2005."
> >
> > ---------------
> --------------------------------
> >
> >
> http://en.wikipedia.org/w/index.php?oldid=13930238>
> > Brothers in
> Arms: Road to Hill 30
> > "This is an
> old revision of this page, as edited by
> >
> 213.212.58.66 (talk) at 12:34, 19 May 2005."
> >
> > ---------------
> --------------------------------
> >
> >
> http://en.wikipedia.org/w/index.php?oldid=355313550>
> > User:Peter I.
> Vardy/sandbox>
> This is an old revision of this page, as edited by Peter I.
> > Vardy (talk |
> contribs) at 10:53, 11 April 2010.
> >
> > ---------------
> --------------------------------
> >
> >
> http://en.wikipedia.org/w/index.php?oldid=346806445>
> > Talk:Amy Shearn
> > "This is an
> old revision of this page, as edited by Yobot
> > (talk |
> contribs) at 02:49, 28 February 2010."
> >
> > ---------------
> --------------------------------
> >
> >
> http://en.wikipedia.org/w/index.php?oldid=351921561>
> > User:Ohms Law
> Bot/Cleanup/Roy D. Bridges, Jr.
> > "This is an
> old revision of this page, as edited by Ohms Law
> > Bot (talk |
> contribs) at 06:26, 25 March 2010."
> >
> > ---------------
> --------------------------------
> >
> >
> http://en.wikipedia.org/w/index.php?oldid=358280940>
> > The Tower Treasure
> > "This is an
> old revision of this page, as edited by
> >
> 69.144.24.63 (talk) at 21:36, 25 April 2010."
> >
> > ---------------
> --------------------------------
> >
> >
> >
> >
> > ----- Original
> Message -----
> > From: Dmitry
> Chichkov <dchichkov@gmail.com>
> > Date: Tuesday,
> July 20, 2010 3:31 pm
> > Subject:
> [Xmldatadumps-l] enwiki dump progress on 20100622 -
> > failed again
> > To:
> xmldatadumps-l@lists.wikimedia.org
> >
> >
> >
> > > Subj:
> http://download.wikimedia.org/enwiki/20100622/> >
> > > Is there
> anything that can be done to alleviate that
> > problem?
> > >
> > > By the way,
> what's the point of producing .bz2 version of
> > the
> > > pages-meta-
> history.xml dump? Is it easier on the system to
> > > produce .bz2
> > > first and
> .7z after that? From the user's perspective I can
> > tell
> > > that .7z is
> > > all I need,
> there is simply no point in working with .bz2
> > (if
> > > .7z is
> > > available).
> > >
> > > -- Regards, Dmitry
> > >
> >
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
>