Hi,
On Thu, Feb 26, 2009 at 2:29 AM, Andrew Garrett <andrew(a)werdn.us> wrote:
On Thu, Feb 26, 2009 at 5:08 AM, John Doe
<phoenixoverride(a)gmail.com> wrote:
But server space saved by compression would be
would be compensated by the
stability, and flexibility provided by this method. this would allow what
ever server is controlling the dump process to designate and delegate
parallel processes for the same dump.
Not nearly -- we're talking about a 100-fold decrease in compression
ratio if we don't compress revisions of the same page adjacent to one
another.
--
Andrew Garrett
No, not nearly that bad. Keep in mind that ~10x of the compression is
just from having English text and repeated XML tags, etc. (Note the
compression ratio of the all-articles dump, which has only one
revision of each article.)
If the revisions in each "block" are sorted by pageid, so that the
revs of the same article are together, you'll get a very large part of
the other 10x factor. Revisions to pages tend to cluster in time
(think edits and reverts :-) as one or more people work on an article,
or it is of news interest (see "Slumdog Millionaire" ;-) or whatever.
You can see this for any given article, like this:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvlim…
look at the first three digits of the revid, when they are the same,
they would be in the same "block" (this is assuming 1M revs/block as I
suggested). You can check any title you like (remember _ for space,
and % escapes for a lot of characters, but a good browser will do that
for you in a lot of cases) Since the majority of edits are for a
minority of titles (some version of the 80/20 rule applies), most
edits/revisions will be in the same block as a number of others for
that page.
So we will get most, but not all, of the other 10X compression ratio.
But even if the compressed blocks are (say) 20% bigger, the win is
that once they are some weeks old, they NEVER need to be re-built.
Each dump (which should then be about weekly, with the same compute
resource, as the queue runs faster ;-) need only build or re-build a
few blocks. (And there is no need at all to parallelize any given
dump, just run 3-5 different ones in parallel as now.)
best, Robert