[WikiEN-l] [Wiki-research-l] Old Wikipedia backups discovered

Joseph Reagle joseph.2008 at reagle.org
Tue Dec 21 16:54:25 UTC 2010


On Tuesday, December 21, 2010, Tim Starling wrote:
> I don't think this is the right approach. The server would have sent a
> MIME type of text/html, which means that it's effectively CP1252.

Yes, you are right, sticking with CP1252 does seem better. I've just updated 10K and many fewer diffs are dropped and the characters display correctly in my Web browser (such as the "København" article Martin Pedersen noted.)
  http://cyber.law.harvard.edu/~reagle/wp-redux/

> I've uploaded my latest attempt at converting the backup to XML:
> 
> http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z

I had a brief look, and I hope it will be useful as an intermediate transition for importing into a wiki, but I'm gonna stick with the original diff_log file for now since I've wasted so much time getting the line feeds and encoding issues sussed out, I don't want to add another layer of XML issues just yet. :-)



More information about the WikiEN-l mailing list