But server space saved by compression would be would be compensated by the stability, and flexibility provided by this method. this would allow what ever server is controlling the dump process to designate and delegate parallel processes for the same dump. so block 1 could be on server 1 and block 2 could be on server 3. that would give the flexibility to use as many servers as are available for this task more efficiently. if block 200 of en.wp breaks for some reason you dont have to rebuild the previous 199 blocks you can just delegate a server to rebuild that single block. that would allow the dump process to be a little more crash friendly (even though I know we dont want to admit crashes happen :) ) this also enables the dump time in future dumps to be cut drasticlly. Id recommend either 10m or 10% of the database which ever is larger for new dumps to screen out a majority of the deletions. what are your thoughts on this process brion (and the rest of the tech team)?
Betacommand
On Wed, Feb 25, 2009 at 9:00 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/2/25 Robert Ullmann rlullmann@gmail.com:
I suggest the history be partitioned into "blocks" by *revision ID*
Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in "block 1", and so on. The English Wiktionary at the moment would have 7 blocks; the English Wikipedia would have 273.
One problem with that is that you won't get such good compression ratios. Most of the revisions of a single article are very similar to the revisions before and after it, so they compress down very small. If you break up the articles between different blocks you don't get that advantage (at least, not to the same extent).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l