On 5/31/06, Brion Vibber brion@pobox.com wrote:
Between other things I've been working on a distributed bzip2 compression tool which could help speed up generation of data dumps.
Alternatively, have you considered generating deltas? (Sorry if this has been brought up before...)
It seems to me there are two main consumption cases of the wikipedia data: - one-off copies ("most recent" doesn't really matter) - mirrors (will want to continually update) If you did a full snapshot once a month, and then daily/weekly deltas on top of that, you could maybe save yourself both processing time and external bandwidth.