On Tuesday, December 21, 2010, Tim Starling wrote:
I don't think this is the right approach. The
server would have sent a
MIME type of text/html, which means that it's effectively CP1252.
Yes, you are right, sticking with CP1252 does seem better. I've just updated 10K and
many fewer diffs are dropped and the characters display correctly in my Web browser (such
as the "København" article Martin Pedersen noted.)
http://cyber.law.harvard.edu/~reagle/wp-redux/
I've uploaded my latest attempt at converting the
backup to XML:
http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
I had a brief look, and I hope it will be useful as an intermediate transition for
importing into a wiki, but I'm gonna stick with the original diff_log file for now
since I've wasted so much time getting the line feeds and encoding issues sussed out,
I don't want to add another layer of XML issues just yet. :-)