[Foundation-l] Delete of Article History and GFDL

Anthony wikimail at inbox.org
Tue Sep 16 22:22:39 UTC 2008


On Tue, Sep 16, 2008 at 10:15 AM, Charlotte Webb <charlottethewebb at gmail.com
> wrote:

> Has anybody ever thought about doing split dumps instead?


Yes, this has been discussed to death by lots of people in various different
forums.  It's not really clear that it would be a significant enough benefit
to be worth the (significant) effort.

Having spent the last 48 hours or so importing one of the smaller dump files
(enwiki-20080312-page.sql.gz) into MySQL, I'd say the bigger benefit would
be derived by creating a set of dump files which are already indexed (could
be in addition to the dumps already made).  Preferably something which could
be accessed in-place while still bzipped (which is actually feasible, and
something I'm about halfway finished writing myself).  I spend way more time
uncompressing and/or importing and/or indexing the dumps than I do
downloading them, and I just don't have the terabytes of free disk space
needed to keep a full dump around uncompressed.

Once I have everything imported into MySQL, I can just download the new stub
dumps and download the new revisions one at a time.  As a bonus, I won't
have to worry about the history dump failing.

I guess I should just pony up a few hundred dollars for a terabyte hard
drive or two.  It should be easy to store the text in 900K bzip chunks
(which I can then index), but only if I have the drive space to expand
everything first and then recompress it.  Anyone want to lend me a couple
terabyte hard drives for a month in exchange for a copy of anything I manage
to produce?



More information about the wikimedia-l mailing list