On Tue, Sep 16, 2008 at 10:15 AM, Charlotte Webb <charlottethewebb@gmail.com
wrote:
Has anybody ever thought about doing split dumps instead?
Yes, this has been discussed to death by lots of people in various different forums. It's not really clear that it would be a significant enough benefit to be worth the (significant) effort.
Having spent the last 48 hours or so importing one of the smaller dump files (enwiki-20080312-page.sql.gz) into MySQL, I'd say the bigger benefit would be derived by creating a set of dump files which are already indexed (could be in addition to the dumps already made). Preferably something which could be accessed in-place while still bzipped (which is actually feasible, and something I'm about halfway finished writing myself). I spend way more time uncompressing and/or importing and/or indexing the dumps than I do downloading them, and I just don't have the terabytes of free disk space needed to keep a full dump around uncompressed.
Once I have everything imported into MySQL, I can just download the new stub dumps and download the new revisions one at a time. As a bonus, I won't have to worry about the history dump failing.
I guess I should just pony up a few hundred dollars for a terabyte hard drive or two. It should be easy to store the text in 900K bzip chunks (which I can then index), but only if I have the drive space to expand everything first and then recompress it. Anyone want to lend me a couple terabyte hard drives for a month in exchange for a copy of anything I manage to produce?