On Fri, Nov 20, 2009 at 10:42 AM, Denny Vrandecic
<denny.vrandecic(a)kit.edu> wrote:
On Nov 20, 2009, at 16:38, Anthony wrote:
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic
<denny.vrandecic(a)kit.edu> wrote:
The newer dump should include almost all material
from the older dumps, so the older dumps are redundant.
Almost redundant :).
Correct -- there is a small amount of data that is *really* deleted, but my gut feeling
is that this is less than 0.1% of all revisions. This would need some evaluation, though.
Or do you mean something else?
No, that's what I mean, though I'm not sure if it's less than 0.1% (I
don't have any guess at all on the percentage). When an article is
"deleted" (set as deleted by an admin, which isn't even *really*
deleted), all revisions are removed from the public portion of the
database, which is where the dump comes from. Then, making up a much
much smaller portion of the material that isn't there, there are
oversighted revisions and individually deleted revisions.
I believe page moves (after a certain date?) are recorded in the logs.
They wouldn't be in the history dump itself, but they could
potentially be backed into by reading the logs.
The main thing that would be missing, and that can't be reconstructed
from the newer dumps, would be deleted articles. 0.1%, weighted by
number of revisions? I have absolutely no idea. I think the number
of deleted revisions is available to the public (through a toolserver
app) though, so we could probably calculate it.