[Foundation-l] Old Wikipedia backups discovered

Federico Leva (Nemo) nemowiki at gmail.com
Thu Dec 16 20:01:40 UTC 2010


Good news from Wiki-research-l in case you're not subscribed to it...

Nemo

-------- Messaggio Originale  --------
Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered
Data: Thu, 16 Dec 2010 13:53:14 -0500
Da: Joseph Reagle

I have the first 10K edits up reconstructed in their various pages at:
   http://cyber.law.harvard.edu/~reagle/wp-redux/

-------- Messaggio Originale  --------
Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered
Data: Fri, 17 Dec 2010 00:03:00 +1100
Da: Tim Starling

On 16/12/10 23:10, Joseph Reagle wrote:
 > On Wednesday, December 15, 2010, Tim Starling wrote:
 >> There were some changes made to the page text that weren't represented
 >> in diff_log, specifically changing certain camel-case links to free
 >> links.
 > It appears my problems were related to some CR/LF issues not 
round-tripping between diff and patch, but I hope to be able to address 
that. And yes, in addition to some of the CamelCase issues, I expect 
another problem is that if a page is blanked "Describe the new page 
here." will reappear outside of the diff_log.

I don't think that will be a problem. But there are other problems
that I've encountered.

UseMod had a deletion feature. It turns out to be easy enough to skip
deleted pages, since they don't have a corresponding entry in rclog.

It also had an admin-only rename feature, which optionally fixed links
in all pages. This accounts for the free link changes I was seeing
earlier. And it had a link replacement feature which could be invoked
without a page move. These features were rarely used, due to the
arcane interface, usually people just moved pages by copying and
pasting. But during the free-link conversion, a lot of pages were
renamed using the admin-only feature.

All these admin-only features were unlogged, but it turns out to be
possible to reconstruct page moves, because when a page was moved, its
name was updated in rclog but not in diff_log. By finding the first
diff_log entry with the new name, you can roughly work out when the
page moves were done.

Anyway, I'm developing a script which will import the dump into a
modified MediaWiki instance, the idea being that I can then export XML
from it. Once it works, I'll upload the XML to somewhere. I'm not sure
when that will be.

-- Tim Starling




More information about the wikimedia-l mailing list