Hi,
On Thu, Feb 26, 2009 at 2:29 AM, Andrew Garrett andrew@werdn.us wrote:
On Thu, Feb 26, 2009 at 5:08 AM, John Doe phoenixoverride@gmail.com wrote:
But server space saved by compression would be would be compensated by the stability, and flexibility provided by this method. this would allow what ever server is controlling the dump process to designate and delegate parallel processes for the same dump.
Not nearly -- we're talking about a 100-fold decrease in compression ratio if we don't compress revisions of the same page adjacent to one another.
-- Andrew Garrett
No, not nearly that bad. Keep in mind that ~10x of the compression is just from having English text and repeated XML tags, etc. (Note the compression ratio of the all-articles dump, which has only one revision of each article.)
If the revisions in each "block" are sorted by pageid, so that the revs of the same article are together, you'll get a very large part of the other 10x factor. Revisions to pages tend to cluster in time (think edits and reverts :-) as one or more people work on an article, or it is of news interest (see "Slumdog Millionaire" ;-) or whatever. You can see this for any given article, like this:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimi...
look at the first three digits of the revid, when they are the same, they would be in the same "block" (this is assuming 1M revs/block as I suggested). You can check any title you like (remember _ for space, and % escapes for a lot of characters, but a good browser will do that for you in a lot of cases) Since the majority of edits are for a minority of titles (some version of the 80/20 rule applies), most edits/revisions will be in the same block as a number of others for that page.
So we will get most, but not all, of the other 10X compression ratio.
But even if the compressed blocks are (say) 20% bigger, the win is that once they are some weeks old, they NEVER need to be re-built. Each dump (which should then be about weekly, with the same compute resource, as the queue runs faster ;-) need only build or re-build a few blocks. (And there is no need at all to parallelize any given dump, just run 3-5 different ones in parallel as now.)
best, Robert