Di (rut) wrote:
Dear All, specially Anthony and Platonides,
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
See my blog posts discussing this matter:
http://leuksman.com/log/2007/10/02/wiki-data-dumps/
http://leuksman.com/log/2007/10/14/incremental-dumps/
http://leuksman.com/log/2007/10/29/wiki-dumps-in-dump-revision-diffs/
The general problem is that there's a lot of data and compressing it takes an ungodly amount of time. When it takes forever to run, you're more likely to hit some cute little error in the middle which causes the process to fail.
Either we need to make the process more resistant to problems or we need to speed it up a lot, or both.
Splitting up the dump into smaller pieces which can be checkpointed (Tim's suggestion), or a recoverable version of the grab-text-from-the-database subprocess (my suggestion) would allow a dump run broken by a lost database connection to continue to completion. (These are not mutually exclusive options.)
The cost of splitting the dump is complication for users -- more files to fetch, more difficulty for automation, possibly changes to client scripts required. But it's also a popular idea to have smaller files to work with in batch.
Replacing thousands-of-revisions-bzipped-or-7zipped-together with a smarter diff to reduce the amount of slow general-purpose compression needed to get a decent download size should also reduce the amount of time it takes to run, making it more likely that a history dump will continue without hitting an error.
This would involve changing the format, necessitating even more changes to client software for compatibility.
Alas, this hasn't yet seen all the work done on it that it needs. Currently we have a programming staff of two (me and Tim) jumping back and forth between too many projects and our own relocations, and neither of us has gotten to the finish line on this project yet. Neither has any other interested party so far.
(Note that the foundation will be hiring a couple more programmers for 2008, as we get the San Francisco office set up.)
Also - now I've got a file from 2006 but I still wonder if there is no place where one can access old dumps - these will/could be very important research wise.
I have a fair number of *old* dumps sitting around at the office, but I'm not sure if I have any medium-depth ones. We don't generally keep old dumps up for download, but I could possibly provide an individual one if needed for research purposes.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
That was originally done because buffering would cause a longer export to fail. The export has since been changed so it should skip buffering, so this possibly could be lifted. I'll take a peek.
-- brion vibber (brion @ wikimedia.org)