On Tue, Feb 24, 2009 at 9:56 AM, Brian <Brian.Mingus(a)colorado.edu> wrote:
Its not at all clear why the english wikipedia dump or
other large
dumps need to be compressed. It is far more absurd to spend hundreds
of days compressing a file than it is to spend tens of days
downloading one.
Faulty premise. Based on my old-ish hardware and the smaller but
still very large ruwiki dump, I'd assume the actual compression of
enwiki would take less than a week of processing time. Since my high
end DSL would take multiple weeks to download ~2 TBs uncompressed, it
is clearly a net time savings to compress it first.
Compression does take substantial time, but my impression is that the
hundreds of days comes mostly from communicating with the data store
and assembling the XML, and not from compressing the output.
-Robert Rohde