On Mon, Feb 23, 2009 at 11:08 AM, Alex mrzmanwiki@gmail.com wrote:
Most of that hasn't been touched in years, and it seems to be mainly a Python wrapper around the dump scripts in /phase3/maintenance/ which also don't seem to have had significant changes recently. Has anything been done recently (in a very broad sense of the word)? Or at least, has anything been written down about what the plans are?
In a "very broad sense" (and not directly connected to main problems), I wrote a compressor [1] that converts full-text history dumps into an "edit syntax" that provides ~95% compression on the larger dumps while keeping it in a plain text format that could still be searched and processed without needing a full decompression.
That's one of several ways to modify the way dump process operates in order to make the output easier to work with (if it takes ~2 TB to expand enwiki's full history, then that is not practical for most users even if we solve the problem of generating it). It is not necessarily true that my specific technology is the right answer, but various changes in formatting to aid distribution, generation, and use are one of the areas that ought to be considered when reimplementing the dump process.
The largest gains are almost certainly going to be in parallelization though. A single monolithic dumper is impractical for enwiki.
-Robert Rohde
[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/editsyntax/