On Tuesday, December 14, 2010, Tim Starling wrote:
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
Unfortunately, it doesn't look like versions of the articles beyond the first ~10 are automatically recoverable. I wrote a Python script to reconstruct the early WP, but it fails because of apparent weaknesses in "normal diffs", which is what UseMod apparently uses. To reconstruct any particular version in time, I iteratively apply all diffs via `patch` up to that point. It doesn't take long before patch chokes on a diff. In fact, I've discovered there are simple cases in which normal_diff/patch are incapable of round tripping.
I hope someone will eventually prove me wrong, or some log is found that is actually capable of recreating the state. (I wonder what the point of providing a diff_log export is if it isn't useable, and perhaps UseMod folks could speak to that.)
On 16/12/10 08:04, Joseph Reagle wrote:
Unfortunately, it doesn't look like versions of the articles beyond the first ~10 are automatically recoverable.
There were some changes made to the page text that weren't represented in diff_log, specifically changing certain camel-case links to free links. If you can work out what the changes were and when they were made, you can recover the text. I successfully recovered all 119 revisions of [[Larry Sanger]], using the following transformation applied after 984005227 UNIX time:
'LarrySanger' => 'Larry Sanger', 'JimboWales' => 'Jimbo Wales', 'WikiPedia' => 'Wikipedia', 'UnitedStates' => 'United States',
I'm not sure how many links were changed in this way, but it seems to have been a hand-constructed list.
-- Tim Starling
On Wednesday, December 15, 2010, Tim Starling wrote:
There were some changes made to the page text that weren't represented in diff_log, specifically changing certain camel-case links to free links.
It appears my problems were related to some CR/LF issues not round-tripping between diff and patch, but I hope to be able to address that. And yes, in addition to some of the CamelCase issues, I expect another problem is that if a page is blanked "Describe the new page here." will reappear outside of the diff_log.
On 16/12/10 23:10, Joseph Reagle wrote:
On Wednesday, December 15, 2010, Tim Starling wrote:
There were some changes made to the page text that weren't represented in diff_log, specifically changing certain camel-case links to free links.
It appears my problems were related to some CR/LF issues not round-tripping between diff and patch, but I hope to be able to address that. And yes, in addition to some of the CamelCase issues, I expect another problem is that if a page is blanked "Describe the new page here." will reappear outside of the diff_log.
I don't think that will be a problem. But there are other problems that I've encountered.
UseMod had a deletion feature. It turns out to be easy enough to skip deleted pages, since they don't have a corresponding entry in rclog.
It also had an admin-only rename feature, which optionally fixed links in all pages. This accounts for the free link changes I was seeing earlier. And it had a link replacement feature which could be invoked without a page move. These features were rarely used, due to the arcane interface, usually people just moved pages by copying and pasting. But during the free-link conversion, a lot of pages were renamed using the admin-only feature.
All these admin-only features were unlogged, but it turns out to be possible to reconstruct page moves, because when a page was moved, its name was updated in rclog but not in diff_log. By finding the first diff_log entry with the new name, you can roughly work out when the page moves were done.
Anyway, I'm developing a script which will import the dump into a modified MediaWiki instance, the idea being that I can then export XML from it. Once it works, I'll upload the XML to somewhere. I'm not sure when that will be.
-- Tim Starling
I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/
This is amazing! Thanks for the work and effort, this reconstruction is a priceless resource for researchers. Lior
On Thu, Dec 16, 2010 at 8:53 PM, Joseph Reagle joseph.2008@reagle.orgwrote:
I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/http://cyber.law.harvard.edu/%7Ereagle/wp-redux/
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Thursday, December 16, 2010, lior gimel wrote:
This is amazing!
And buggy! :-)
Thanks for the work and effort, this reconstruction is a priceless resource for researchers.
Thanks to Tim for providing the data, and for working on a much better version that I look forward to!
wiki-research-l@lists.wikimedia.org