Brion Vibber wrote:
This last dump, though, had an unknown problem in the
XML skeleton dump
which didn't produce any output error message.
The skeleton dump is usually very reliable, as it doesn't have to sit
there begging external storage servers for data. I'll have to take a
peek at it...
A few weeks ago, the mode constants for the WikiExporter class were
renamed and not all uses were updated. One of the uses was in
backup.inc, used by dumpBackup.php which generates these dumps.
Since the old name of the constant was no longer valid, the exporter
object wasn't informed that the dump wanted to use unbuffered queries.
Since the database connection wasn't switched to unbuffered mode, the
*entire contents of the page and revision tables* were buffered into
memory before the XML skeleton output could be produced.
That works on small wikis, but on the biggest ones this is too much data
to fit in memory, leading to the process being killed by the operating
system.
I've fixed backup.inc to refer to the new name of the constant in r17405.
enwiki dump is restarted now; dewiki should get around later in its dump
group cycle. Sighhs :)
A general note: I've been planning to replace dumpTextPass.php with a
more reliable manager program (perhaps Java, to reuse existing mwdumper
code) which interfaces with a minimal PHP script that just loads text
records out of the database. If the PHP crashes out, the manager program
can just restart it at will.
[The PHP layer is needed because our text storage is hideously
complicated, with ever-shifting database pools, legacy encodings,
compression, batch compression, etc. It's easier to leave that logic in
one place in MediaWiki rather than try to reproduce it and keep that in
sync in a utility program.]
I haven't quite got around to writing this yet. If someone *really*
wants to do it in the next week or two they'd be my friend, otherwise
it's on my todo list... :)
In short it needs to:
a: read in skeleton XML dump [all page/revision data, no text]
b: read in previous full XML dump [which has text!]
c: talk to MediaWiki script to pull text revisions not found in previous
full XML dump, restarting if it dies
d: output full XML including text
Shouldn't be *that* hard.
-- brion vibber (brion @
pobox.com)