Anthony wrote:
I've looked at the numbers and thought about this
in detail and I don't
think so. What definitely *would* be much more user friendly is to use a
compression scheme which allows random access, so that end users don't have
to decompress everything all at once in the first place.
The uncompressed full history English Wikipedia dump is reaching (and more
likely has already exceeded) the size which will fit on the largest consumer
hard drives. So just dealing with such a large file is a problem in
itself. And "an enormous text file" is not very useful without an index, so
you've gotta import the thing into some sort of database anyway, which,
unless you're a database guru is going to take longer than a simple
decompression.
In the long term, and considering how long it's taking to just produce a
usable dump the long term may never come, the most user friendly dump would
already be compressed, indexed, and ready for random access, so a reuser
could just download and go (or even download on the fly as needed). It
could be done, but I make no bet on whether or not it will be done.
I did make indexed, random-access, backwards compatible, XML dumps.
http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html
Wouldn't be hard to plug into the dump process (just replace bzip2 on a
new DumpPipeOutput) but so far nobody seemed interested on it.
And there's the added benefit of the offline reader I implemented using
those files.