Jakob Voss wrote:
When I tried to parse the current German XML dump I
discovered the
following malformed sequence (in [[de:India]]):
[[got:��...
It looks like someone tried to encode a unicode surogate pair with
XML character references. Maybe MediaWiki does not recognize #xD800;
as an invalid unicode character and transformed it into this form.
I have not tried to send invalid unicode characters in an edit form
to reproduce the error.
MediaWiki's UTF-8 validation should reject a literal U+D800, and a
literal � would of course be transformed into � in XML
output.
Could be a bug in Mono's XmlWriter implementation. (The dumps from
MediaWiki are filtered and split into multiple streams by a program I
wrote in C# to produce full, current-only, and current-non-talk-
non-userpage dumps from one run.) I'll take a look.
BTW: I doubt that anyone has ever tried to validate
the huge XML dump as
a whole - as far as I know validating XML streams (given an XML schema)
is still a research topic. It's not the only part where MediaWiki
touches the research border of current computer science :-)
I have done test validations of the XML dumps as a whole before, using
Xerces. Here's the shell script wrapper I use:
#!/bin/sh
XERCES=/home/brion/src/xerces/xerces-2_6_2
java -classpath $XERCES/xercesImpl.jar:$XERCES/xercesSamples.jar
sax.Counter -n -v -s -f $@
A working file:
$ schema-check demo2.xml
demo2.xml: 31636 ms (17286 elems, 1736 attrs, 0 spaces, 433774 chars)
With �� slipped in:
$ schema-check demox.xml
[Fatal Error] demox.xml:48:15: Character reference "�" is an
invalid XML character.
-- brion vibber (brion @
pobox.com)