[Foundation-l] dumps

Anthony wikimail at inbox.org
Tue Feb 24 18:59:37 UTC 2009


On Tue, Feb 24, 2009 at 1:24 PM, Brian <Brian.Mingus at colorado.edu> wrote:

> I am of the understanding that the WMF's bandwidth is very cheap.
>

Compared to what?

If you want to consider costs, I think its appropriate to consider the
> costs not only to the WMF but to the user. Different compression
> algorithms have different encode/decode ratios but if it takes a
> cluster to compress a file there's a good chance you're going to want
> one to decompress it.


bzip2 decompression speeds on an average CPU almost certainly exceed
Internet download speeds.


> It may in fact be much more user friendly to
> simply offer an enormous text file for download because users don't
> have to unpack it.


I've looked at the numbers and thought about this in detail and I don't
think so.  What definitely *would* be much more user friendly is to use a
compression scheme which allows random access, so that end users don't have
to decompress everything all at once in the first place.

The uncompressed full history English Wikipedia dump is reaching (and more
likely has already exceeded) the size which will fit on the largest consumer
hard drives.  So just dealing with such a large file is a problem in
itself.  And "an enormous text file" is not very useful without an index, so
you've gotta import the thing into some sort of database anyway, which,
unless you're a database guru is going to take longer than a simple
decompression.

In the long term, and considering how long it's taking to just produce a
usable dump the long term may never come, the most user friendly dump would
already be compressed, indexed, and ready for random access, so a reuser
could just download and go (or even download on the fly as needed).  It
could be done, but I make no bet on whether or not it will be done.

Our mission is to spread knowledge. Compressing that knowledge has
> been in the way of spreading it for years now. Its high time we gave
> up!


Clearly something is in the way.  I don't think it's the compression,
though.



More information about the wikimedia-l mailing list