I see that the latest dump of the English Wikipedia failed (I mean, the dump
of all the page histories).
As part of some other work I am doing, I have efficient code that can "take
apart" a dump into its single component pages, and out of that, it would be
possible to fashion code that "stitches together" various partial dumps.
This would allow to break up a single dump process into multiple, shorter
processes, in which for example one only dumps one month worth, or one week
worth, of revisions to the English wikipedia.
Breaking up the dump process will increase the probability that each of the
smaller dumps succeeds.
For instance, one could have all the partial dumps, the launch the stitching
process, and the stitching process produces a single dump, removing
duplicate revisions.
At UCSC, where I work, there are various Master students looking for
projects... and some may be interested in doing work that is concretely
useful to the Wikipedia. Should I try to get them interested in writing a
proper dump stitching tool, and some code to do partial dumps?
Can Brion or Tim give us more detail on why the dumps are failing? Are they
already doing partial dumps? Is there already a dump stitching tool? Is
there anything that could be done to help the process? I could help by
looking for database students in search of a project and giving them my code
as a starting point...
Best,
Luca