On Tue, Dec 21, 2010 at 7:51 PM, Tim Starling tstarling@wikimedia.org wrote:
In XML 1.1:
"Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */"
Where are you reading that? At http://www.w3.org/TR/xml11/#charsets I read:
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ [2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
Without this change, importDump.php gives a fatal error.
Have you tried escaping them? Does importDump.php work with XML 1.1, or only XML 1.0? Is the file defined as XML 1.1 or XML 1.0? If the file is designated as XML 1.1 (*), the control characters are escaped, and importDump.php still gives a fatal error, it sounds like a bug in importDump.php.
"Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection, the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) The minor sacrifice of backward compatibility is considered not significant. Due to potential problems with APIs, #x0 is still forbidden both directly and as a character reference."
(*) Ah, there's one problem. It isn't. http://www.mediawiki.org/xml/export-0.3.xsd starts with xml version="1.0".