On Thursday, December 16, 2010, Federico Leva (Nemo) wrote:
I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/
I fixed some of the encoding issues. The DB dump contained different encodings. So, the encoding of each diff in the dump is independently now guessed using Python's CharDet (Universal Encoding Detector) library.
So now you can read up on the few "accented" topics in the early Wikipedia including: Göteborg, Köpenhamn, and Křbenhavn. (Nothing very exciting.) But it means articles, such as ASCII, are much improved as well. Interestingly, the ASCII page isn't about ASCII itself so much, but as to how to type non-ascii characters in the early Wikipedia.
http://cyber.law.harvard.edu/~reagle/wp-redux/ASCII/983670583.html
On 18/12/10 07:18, Joseph Reagle wrote:
On Thursday, December 16, 2010, Federico Leva (Nemo) wrote:
I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/
I fixed some of the encoding issues. The DB dump contained different encodings. So, the encoding of each diff in the dump is independently now guessed using Python's CharDet (Universal Encoding Detector) library.
I don't think this is the right approach. The server would have sent a MIME type of text/html, which means that it's effectively CP1252. Perhaps some broken browsers also submitted edits with some other character encoding. But everyone would have seen mojibake at the time, and so mojibake is what we should show in the archive. Any such issues would have been fixed by subsequent edits, and those edits won't make sense if you "correct" the encoding in the original edit.
I've uploaded my latest attempt at converting the backup to XML:
http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
The archive contains an invalid XML file, with control characters preserved, and a valid XML file, with control characters filtered. There's also a log file detailing the outstanding issues. The script is in Subversion.
-- Tim Starling
On Tuesday, December 21, 2010, Tim Starling wrote:
I don't think this is the right approach. The server would have sent a MIME type of text/html, which means that it's effectively CP1252.
Yes, you are right, sticking with CP1252 does seem better. I've just updated 10K and many fewer diffs are dropped and the characters display correctly in my Web browser (such as the "København" article Martin Pedersen noted.) http://cyber.law.harvard.edu/~reagle/wp-redux/
I've uploaded my latest attempt at converting the backup to XML:
http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
I had a brief look, and I hope it will be useful as an intermediate transition for importing into a wiki, but I'm gonna stick with the original diff_log file for now since I've wasted so much time getting the line feeds and encoding issues sussed out, I don't want to add another layer of XML issues just yet. :-)
On Tue, Dec 21, 2010 at 10:02 AM, Tim Starling tstarling@wikimedia.org wrote:
I've uploaded my latest attempt at converting the backup to XML:
http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
The archive contains an invalid XML file, with control characters preserved, and a valid XML file, with control characters filtered.
Which control characters? Aren't control characters allowed in XML 1.1?
On 22/12/10 11:43, Anthony wrote:
On Tue, Dec 21, 2010 at 10:02 AM, Tim Starling tstarling@wikimedia.org wrote:
I've uploaded my latest attempt at converting the backup to XML:
http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
The archive contains an invalid XML file, with control characters preserved, and a valid XML file, with control characters filtered.
Which control characters? Aren't control characters allowed in XML 1.1?
I did: tr -d '\000-\010\013-\037'
In XML 1.1:
"Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.
"Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */"
Without this change, importDump.php gives a fatal error.
-- Tim Starling
On Tue, Dec 21, 2010 at 7:51 PM, Tim Starling tstarling@wikimedia.org wrote:
In XML 1.1:
"Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */"
Where are you reading that? At http://www.w3.org/TR/xml11/#charsets I read:
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ [2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
Without this change, importDump.php gives a fatal error.
Have you tried escaping them? Does importDump.php work with XML 1.1, or only XML 1.0? Is the file defined as XML 1.1 or XML 1.0? If the file is designated as XML 1.1 (*), the control characters are escaped, and importDump.php still gives a fatal error, it sounds like a bug in importDump.php.
"Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection, the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) The minor sacrifice of backward compatibility is considered not significant. Due to potential problems with APIs, #x0 is still forbidden both directly and as a character reference."
(*) Ah, there's one problem. It isn't. http://www.mediawiki.org/xml/export-0.3.xsd starts with xml version="1.0".
On Tue, Dec 21, 2010 at 8:13 PM, Anthony wikimail@inbox.org wrote:
Have you tried escaping them?
By which I mean, using character references.
wiki-research-l@lists.wikimedia.org