Hi,
When I tried to parse the current German XML dump I discovered the following malformed sequence (in [[de:India]]):
[[got:��...
It looks like someone tried to encode a unicode surogate pair with XML character references. Maybe MediaWiki does not recognize #xD800; as an invalid unicode character and transformed it into this form. I have not tried to send invalid unicode characters in an edit form to reproduce the error.
Anyway the dump is broken. It's not well-formed XML (so it's no XML at all but "looks-like-XML") and every correct XML-Parser will fail to parse it.
According the the XML specification (1.0) Chapter 2.2 legal characters in XML are any Unicode characters, excluding the surrogate blocks, FFFE, and FFFF.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Using any the following unicode character will make SpecialExport and the XML dump fail:
#x0-#x8, #xB-#xC, #x0E-#x1F, #xD800-#xDFFF, #xFFFE-#xFFFF, 𑀀-...
Additionally you can use hexadecimal and decimal character references - I don't know how the wrong characters were encoded in the SQL database.
Greetings, Jakob
BTW: I doubt that anyone has ever tried to validate the huge XML dump as a whole - as far as I know validating XML streams (given an XML schema) is still a research topic. It's not the only part where MediaWiki touches the research border of current computer science :-)