>> 20100719 4:37:21am PST
>> # 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
>> Error 5 of allowed 5 retrieving revision text for text id 358280940! Pausing 5 seconds before retry...

Well, my comment here would be that the number of 'allowed errors = 5' and the 'retry delay 5 seconds' seem to be rather small. From that it looks like a 25 seconds database unavailability would cause backup failure. Considering that backup literally takes a month...

I'd suggest setting the error rate to something like 0.01% of the number of revisions. Also an incomplete dump (e.g. with missing revisions texts) is much much better than nothing, so it would only make sense to allow higher error rates or even make the interruption procedure manual.

To put that 0.01% error rate into perspective, according to my estimates the error rate in the lase "complete" database dump [enwiki-20100130 31.9GB/280GB] was at least ~0.4% (missing revisions texts due to backup process  failures).

-- Regards, Dmitry





On Wed, Jul 21, 2010 at 4:03 AM, Jamie Morken <jmorken@shaw.ca> wrote:

Hi,

I was polling the http://download.wikimedia.org/enwiki/20100622/ page during the pages-meta-history.xml.bz2 database dump and here is some timestamped output from that page showing some errors that caused the dump to fail.  Regarding the .bz2 dump format, Tomasz earlier suggested removing it and using .7z.  I thought it might be good to keep the .bz2 format due to there being several programs that use it (ie. wikitaxi, bzreader).  7z format is probably the way to go though for the future, but I don't know if this would fix the database dump errors.

cheers,
Jamie


-----------------------------------------------

20100719 2:22:14am

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
2010-07-19 09:22:11: enwiki 889057 pages (0.613/sec), 110108000 revs (75.931/sec), 83.6% prefetched, ETA 2010-08-28 05:12:01 [max 371385750]

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.7 GB (written)



-----------------------------------------------

20100719 3:07:16am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
2010-07-19 10:07:15: enwiki 894194 pages (0.615/sec), 110399000 revs (75.990/sec), 83.6% prefetched, ETA 2010-08-28 04:08:46 [max 371385750]

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 3:22:17am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 2 of allowed 5 retrieving revision text for text id 10595737! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 3:37:18am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 3 of allowed 5 retrieving revision text for text id 13930238! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)


-----------------------------------------------

20100719 3:52:19am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 4 of allowed 5 retrieving revision text for text id 355313550! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:07:20am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 3 of allowed 5 retrieving revision text for text id 346806445! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:22:21am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 4 of allowed 5 retrieving revision text for text id 351921561! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:37:21am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 5 of allowed 5 retrieving revision text for text id 358280940! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:52:24am PST

# 2010-07-19 11:37:24 failed All pages with complete page edit history (.bz2)
#6 {main}

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2

-----------------------------------------------




pages referenced in the above errors:

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=10595737

Brothers in Arms: Road to Hill 30
"This is an old revision of this page, as edited by Colonel Cow (talk | contribs) at 01:02, 17 February 2005."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=13930238

Brothers in Arms: Road to Hill 30
"This is an old revision of this page, as edited by 213.212.58.66  (talk) at 12:34, 19 May 2005."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=355313550

User:Peter I. Vardy/sandbox
This is an old revision of this page, as edited by Peter I. Vardy (talk | contribs)  at 10:53, 11 April 2010.

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=346806445

Talk:Amy Shearn
"This is an old revision of this page, as edited by Yobot (talk | contribs) at 02:49, 28 February 2010."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=351921561

User:Ohms Law Bot/Cleanup/Roy D. Bridges, Jr.
"This is an old revision of this page, as edited by Ohms Law Bot (talk | contribs) at 06:26, 25 March 2010."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=358280940

The Tower Treasure
"This is an old revision of this page, as edited by 69.144.24.63  (talk) at 21:36, 25 April 2010."

-----------------------------------------------




----- Original Message -----
From: Dmitry Chichkov <dchichkov@gmail.com>
Date: Tuesday, July 20, 2010 3:31 pm
Subject: [Xmldatadumps-l] enwiki dump progress on 20100622 - failed again
To: xmldatadumps-l@lists.wikimedia.org

> Subj: http://download.wikimedia.org/enwiki/20100622/
>
> Is there anything that can be done to alleviate that problem?
>
> By the way, what's the point of producing .bz2 version of the
> pages-meta-history.xml dump? Is it easier on the system to
> produce .bz2
> first and .7z after that? From the user's perspective I can tell
> that .7z is
> all I need, there is simply no point in working with .bz2 (if
> .7z is
> available).
>
> -- Regards, Dmitry
>