[Foundation-l] dumps

Robert Rohde rarohde at gmail.com
Tue Feb 24 23:51:41 UTC 2009


On Tue, Feb 24, 2009 at 9:56 AM, Brian <Brian.Mingus at colorado.edu> wrote:
> Its not at all clear why the english wikipedia dump or other large
> dumps need to be compressed. It is far more absurd to spend hundreds
> of days compressing a file than it is to spend tens of days
> downloading one.

Faulty premise.  Based on my old-ish hardware and the smaller but
still very large ruwiki dump, I'd assume the actual compression of
enwiki would take less than a week of processing time.  Since my high
end DSL would take multiple weeks to download ~2 TBs uncompressed, it
is clearly a net time savings to compress it first.

Compression does take substantial time, but my impression is that the
hundreds of days comes mostly from communicating with the data store
and assembling the XML, and not from compressing the output.

-Robert Rohde




More information about the wikimedia-l mailing list