[Foundation-l] dumps

Tue Feb 24 22:34:57 UTC 2009

Anthony wrote:
> I've looked at the numbers and thought about this in detail and I don't
> think so.  What definitely *would* be much more user friendly is to use a
> compression scheme which allows random access, so that end users don't have
> to decompress everything all at once in the first place.
> 
> The uncompressed full history English Wikipedia dump is reaching (and more
> likely has already exceeded) the size which will fit on the largest consumer
> hard drives.  So just dealing with such a large file is a problem in
> itself.  And "an enormous text file" is not very useful without an index, so
> you've gotta import the thing into some sort of database anyway, which,
> unless you're a database guru is going to take longer than a simple
> decompression.
> 
> In the long term, and considering how long it's taking to just produce a
> usable dump the long term may never come, the most user friendly dump would
> already be compressed, indexed, and ready for random access, so a reuser
> could just download and go (or even download on the fly as needed).  It
> could be done, but I make no bet on whether or not it will be done.

I did make indexed, random-access, backwards compatible, XML dumps.
http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html

Wouldn't be hard to plug into the dump process (just replace bzip2 on a
new DumpPipeOutput) but so far nobody seemed interested on it.

And there's the added benefit of the offline reader I implemented using
those files.