[WikiEN-l] Old Wikipedia backups discovered

Tim Starling tstarling at wikimedia.org
Tue Dec 21 15:02:08 UTC 2010


On 18/12/10 07:18, Joseph Reagle wrote:
> On Thursday, December 16, 2010, Federico Leva (Nemo) wrote:
>> I have the first 10K edits up reconstructed in their various pages at:
>>    http://cyber.law.harvard.edu/~reagle/wp-redux/
> I fixed some of the encoding issues. The DB dump contained different encodings. So, the encoding of each diff in the dump is independently now guessed using Python's CharDet (Universal Encoding Detector) library.

I don't think this is the right approach. The server would have sent a
MIME type of text/html, which means that it's effectively CP1252.
Perhaps some broken browsers also submitted edits with some other
character encoding. But everyone would have seen mojibake at the time,
and so mojibake is what we should show in the archive. Any such issues
would have been fixed by subsequent edits, and those edits won't make
sense if you "correct" the encoding in the original edit.

I've uploaded my latest attempt at converting the backup to XML:

http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z

The archive contains an invalid XML file, with control characters
preserved, and a valid XML file, with control characters filtered.
There's also a log file detailing the outstanding issues. The script
is in Subversion.

-- Tim Starling



More information about the WikiEN-l mailing list