Enwiki dump failing (twice in a row) - Wikitech-l - lists.wikimedia.org

List overview All Threads
Download

Enwiki dump failing (twice in a row)

wikistats (to Brion)

Re: [Wikitech-l] [Wiki-research-l]...

John Grohol

3 Nov 2006 3 Nov '06

10:57 a.m.

The past two dumps of enwiki have failed; last successful dump was in September. Any ideas on what might be causing this?

Reply

Show replies by date

Tim Starling

3 Nov 3 Nov

7:10 p.m.

John Grohol wrote:

The past two dumps of enwiki have failed; last successful dump was in September. Any ideas on what might be causing this?

I don't think there's any mystery. It's not restartable, and there's all sorts of things that can make the process die. A simple solution would be to split the dump by page_id, say into 10 files. Then if the process died, you could restart from the start of that file, rather than the start of the wiki. If it's too big a hassle for clients to download 10 files instead of 1, then you can always put them in a separate directory and serve them by anonymous FTP. -- Tim Starling

Reply

Gregory Maxwell

7:35 p.m.

On 11/3/06, Tim Starling <tstarling(a)wikimedia.org> wrote:

John Grohol wrote:

The past two dumps of enwiki have failed; last successful dump was in September. Any ideas on what might be causing this?

I don't think there's any mystery. It's not restartable, and there's all sorts of things that can make the process die. A simple solution would be to split the dump by page_id, say into 10 files. Then if the process died, you could restart from the start of that file, rather than the start of the wiki. If it's too big a hassle for clients to download 10 files instead of 1, then you can always put them in a separate directory and serve them by anonymous FTP.

Wee even more non-atomic. In any case.. if we could complete a dump in ten files, there is absolutely no reason we couldn't concatenate the results. What are you smoking? :)

Reply

Nick Jenkins

8:09 p.m.

John Grohol wrote:

The past two dumps of enwiki have failed; last successful dump was in September. Any ideas on what might be causing this?

I don't think there's any mystery. It's not restartable, and there's all sorts of things that can make the process die. A simple solution would be to split the dump by page_id, say into 10 files. Then if the process died, you could restart from the start of that file, rather than the start of the wiki. If it's too big a hassle for clients to download 10 files instead of 1, then you can always put them in a separate directory and serve them by anonymous FTP. -- Tim Starling

Wasn't there talk about something snazzy being added so that if the database connection went away or a query failed whilst creating the dump, it would catch an exception, sleep a bit, and keep retrying that operation until the connection was re-established or query succeeded? All the best, Nick.

Reply

Brion Vibber

4 Nov 4 Nov

6:48 a.m.

Nick Jenkins wrote:

John Grohol wrote:

The past two dumps of enwiki have failed; last successful dump was in September. Any ideas on what might be causing this?

I don't think there's any mystery. It's not restartable, and there's all sorts of things that can make the process die. A simple solution would be to split the dump by page_id, say into 10 files. Then if the process died, you could restart from the start of that file, rather than the start of the wiki. If it's too big a hassle for clients to download 10 files instead of 1, then you can always put them in a separate directory and serve them by anonymous FTP. -- Tim Starling

Wasn't there talk about something snazzy being added so that if the database connection went away or a query failed whilst creating the dump, it would catch an exception, sleep a bit, and keep retrying that operation until the connection was re-established or query succeeded?

Yes, and that does in fact work -- if you sit and watch the dump status you'll sometimes see the message that it disconnected and it waiting to retry. This last dump, though, had an unknown problem in the XML skeleton dump which didn't produce any output error message. The skeleton dump is usually very reliable, as it doesn't have to sit there begging external storage servers for data. I'll have to take a peek at it... -- brion vibber (brion @ pobox.com)

Reply

Brion Vibber

7:41 a.m.

Brion Vibber wrote:

This last dump, though, had an unknown problem in the XML skeleton dump which didn't produce any output error message. The skeleton dump is usually very reliable, as it doesn't have to sit there begging external storage servers for data. I'll have to take a peek at it...

A few weeks ago, the mode constants for the WikiExporter class were renamed and not all uses were updated. One of the uses was in backup.inc, used by dumpBackup.php which generates these dumps. Since the old name of the constant was no longer valid, the exporter object wasn't informed that the dump wanted to use unbuffered queries. Since the database connection wasn't switched to unbuffered mode, the *entire contents of the page and revision tables* were buffered into memory before the XML skeleton output could be produced. That works on small wikis, but on the biggest ones this is too much data to fit in memory, leading to the process being killed by the operating system. I've fixed backup.inc to refer to the new name of the constant in r17405. enwiki dump is restarted now; dewiki should get around later in its dump group cycle. Sighhs :) A general note: I've been planning to replace dumpTextPass.php with a more reliable manager program (perhaps Java, to reuse existing mwdumper code) which interfaces with a minimal PHP script that just loads text records out of the database. If the PHP crashes out, the manager program can just restart it at will. [The PHP layer is needed because our text storage is hideously complicated, with ever-shifting database pools, legacy encodings, compression, batch compression, etc. It's easier to leave that logic in one place in MediaWiki rather than try to reproduce it and keep that in sync in a utility program.] I haven't quite got around to writing this yet. If someone *really* wants to do it in the next week or two they'd be my friend, otherwise it's on my todo list... :) In short it needs to: a: read in skeleton XML dump [all page/revision data, no text] b: read in previous full XML dump [which has text!] c: talk to MediaWiki script to pull text revisions not found in previous full XML dump, restarting if it dies d: output full XML including text Shouldn't be *that* hard. -- brion vibber (brion @ pobox.com)

Reply

6392

days inactive

6393

days old

wikitech-l@lists.wikimedia.org

Manage subscription

5 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Gregory Maxwell
John Grohol
Nick Jenkins
Tim Starling