[WikiEN-l] [Wiki-research-l] Old Wikipedia backups discovered

Anthony wikimail at inbox.org
Wed Dec 22 01:13:25 UTC 2010


On Tue, Dec 21, 2010 at 7:51 PM, Tim Starling <tstarling at wikimedia.org> wrote:
> In XML 1.1:
>
> "Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] |
> [#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character,
> excluding the surrogate blocks, FFFE, and FFFF. */"

Where are you reading that?  At http://www.w3.org/TR/xml11/#charsets I read:

[2]   	Char	   ::=   	[#x1-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
[2a]   	RestrictedChar	   ::=   	[#x1-#x8] | [#xB-#xC] | [#xE-#x1F] |
[#x7F-#x84] | [#x86-#x9F]

> Without this change, importDump.php gives a fatal error.

Have you tried escaping them?  Does importDump.php work with XML 1.1,
or only XML 1.0?  Is the file defined as XML 1.1 or XML 1.0?  If the
file is designated as XML 1.1 (*), the control characters are escaped,
and importDump.php still gives a fatal error, it sounds like a bug in
importDump.php.

"Finally, there is considerable demand to define a standard
representation of arbitrary Unicode characters in XML documents.
Therefore, XML 1.1 allows the use of character references to the
control characters #x1 through #x1F, most of which are forbidden in
XML 1.0.  For reasons of robustness, however, these characters still
cannot be used directly in documents.  In order to improve the
robustness of character encoding detection, the additional control
characters #x7F through #x9F, which were freely allowed in XML 1.0
documents, now must also appear only as character references.
(Whitespace characters are of course exempt.)  The minor sacrifice of
backward compatibility is considered not significant.  Due to
potential problems with APIs, #x0 is still forbidden both directly and
as a character reference."

(*) Ah, there's one problem.  It isn't.
http://www.mediawiki.org/xml/export-0.3.xsd starts with xml
version="1.0".



More information about the WikiEN-l mailing list