-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Between other things I've been working on a distributed bzip2 compression tool which could help speed up generation of data dumps.
By trading LAN bandwidth for idle CPU elsewhere in the server cluster, an order-of-magnitude improvement in throughput seems reasonably practical; this could cut bzip2 compression time for the large English Wikipedia history dumps by a full day.
Status/documentation: http://www.mediawiki.org/wiki/dbzip2
Source: http://svn.wikimedia.org/viewvc/mediawiki/trunk/dbzip2
Updates on my (*blush*) development blog: http://leuksman.com/
I'm hoping something similar can be accomplished with 7zip as well...
- -- brion vibber (brion @ pobox.com)
On 5/31/06, Brion Vibber brion@pobox.com wrote:
Between other things I've been working on a distributed bzip2 compression tool which could help speed up generation of data dumps.
Alternatively, have you considered generating deltas? (Sorry if this has been brought up before...)
It seems to me there are two main consumption cases of the wikipedia data: - one-off copies ("most recent" doesn't really matter) - mirrors (will want to continually update) If you did a full snapshot once a month, and then daily/weekly deltas on top of that, you could maybe save yourself both processing time and external bandwidth.
Evan Martin wrote:
On 5/31/06, Brion Vibber brion@pobox.com wrote:
Between other things I've been working on a distributed bzip2 compression tool which could help speed up generation of data dumps.
Alternatively, have you considered generating deltas? (Sorry if this has been brought up before...)
Many times, but that's not necessarily clear or simple. The generic delta-generation tools we've tried in the past just choke on our files; note that the full-history dump of English Wikipedia -- the one we're most concerned about having archival copies of available -- is over 350 gigabytes uncompressed.
(Clean XML-wrapped text with no scary internal compression or diffing, and a well-known standard compression format on the outside, is a simple and relatively future-proof for third-party textual analysis and reuse and long-term archiving.)
Something application-specific might be possible.
It seems to me there are two main consumption cases of the wikipedia data:
- one-off copies ("most recent" doesn't really matter)
- mirrors (will want to continually update)
If you did a full snapshot once a month, and then daily/weekly deltas on top of that, you could maybe save yourself both processing time and external bandwidth.
Even if I only did full snapshots a quarter as often, I'd still want them to take two days instead of ten. :)
-- brion vibber (brion @ pobox.com)
On Wed, May 31, 2006 at 10:08:12PM -0700, Brion Vibber wrote:
Even if I only did full snapshots a quarter as often, I'd still want them to take two days instead of ten. :)
Yeah, I was a little squeezy when I heard you were going to *shorten* them by a day; that's like hearing they're going to give you $5000 off the price of the car -- what does the car *cost*??
Cheers, -- jra
wikitech-l@lists.wikimedia.org