On Fri, Nov 20, 2009 at 10:42 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
On Nov 20, 2009, at 16:38, Anthony wrote:
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
Almost redundant :).
Correct -- there is a small amount of data that is *really* deleted, but my gut feeling is that this is less than 0.1% of all revisions. This would need some evaluation, though.
Or do you mean something else?
No, that's what I mean, though I'm not sure if it's less than 0.1% (I don't have any guess at all on the percentage). When an article is "deleted" (set as deleted by an admin, which isn't even *really* deleted), all revisions are removed from the public portion of the database, which is where the dump comes from. Then, making up a much much smaller portion of the material that isn't there, there are oversighted revisions and individually deleted revisions.
I believe page moves (after a certain date?) are recorded in the logs. They wouldn't be in the history dump itself, but they could potentially be backed into by reading the logs.
The main thing that would be missing, and that can't be reconstructed from the newer dumps, would be deleted articles. 0.1%, weighted by number of revisions? I have absolutely no idea. I think the number of deleted revisions is available to the public (through a toolserver app) though, so we could probably calculate it.