[WikiEN-l] Old Wikipedia backups discovered

Joseph Reagle joseph.2008 at reagle.org
Wed Dec 15 21:04:26 UTC 2010


On Tuesday, December 14, 2010, Tim Starling wrote:
> I didn't want to believe that those revisions had been lost forever,
> and I even opened the UseMod source code and stared forlornly at the
> unlink() call. What I (and Brion before) missed is that UseMod appends
> a record of every change made to two files, called diff_log and rclog.
> In these two files is a record of every change made to Wikipedia from
> January 15 to August 17, 2001.

Unfortunately, it doesn't look like versions of the articles beyond the first ~10 are automatically recoverable. I wrote a Python script to reconstruct the early WP, but it fails because of apparent weaknesses in "normal diffs", which is what UseMod apparently uses. To reconstruct any particular version in time, I iteratively apply all diffs via `patch` up to that point. It doesn't take long before patch chokes on a diff. In fact, I've discovered there are simple cases in which normal_diff/patch are incapable of round tripping.

I hope someone will eventually prove me wrong, or some log is found that is actually capable of recreating the state. (I wonder what the point of providing a diff_log export is if it isn't useable, and perhaps UseMod folks could speak to that.)



More information about the WikiEN-l mailing list