Evan Martin wrote:
On 5/31/06, Brion Vibber brion@pobox.com wrote:
Between other things I've been working on a distributed bzip2 compression tool which could help speed up generation of data dumps.
Alternatively, have you considered generating deltas? (Sorry if this has been brought up before...)
Many times, but that's not necessarily clear or simple. The generic delta-generation tools we've tried in the past just choke on our files; note that the full-history dump of English Wikipedia -- the one we're most concerned about having archival copies of available -- is over 350 gigabytes uncompressed.
(Clean XML-wrapped text with no scary internal compression or diffing, and a well-known standard compression format on the outside, is a simple and relatively future-proof for third-party textual analysis and reuse and long-term archiving.)
Something application-specific might be possible.
It seems to me there are two main consumption cases of the wikipedia data:
- one-off copies ("most recent" doesn't really matter)
- mirrors (will want to continually update)
If you did a full snapshot once a month, and then daily/weekly deltas on top of that, you could maybe save yourself both processing time and external bandwidth.
Even if I only did full snapshots a quarter as often, I'd still want them to take two days instead of ten. :)
-- brion vibber (brion @ pobox.com)