Hi,
When I tried to parse the current German XML dump I discovered the following malformed sequence (in [[de:India]]):
[[got:��...
It looks like someone tried to encode a unicode surogate pair with XML character references. Maybe MediaWiki does not recognize #xD800; as an invalid unicode character and transformed it into this form. I have not tried to send invalid unicode characters in an edit form to reproduce the error.
Anyway the dump is broken. It's not well-formed XML (so it's no XML at all but "looks-like-XML") and every correct XML-Parser will fail to parse it.
According the the XML specification (1.0) Chapter 2.2 legal characters in XML are any Unicode characters, excluding the surrogate blocks, FFFE, and FFFF.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Using any the following unicode character will make SpecialExport and the XML dump fail:
#x0-#x8, #xB-#xC, #x0E-#x1F, #xD800-#xDFFF, #xFFFE-#xFFFF, 𑀀-...
Additionally you can use hexadecimal and decimal character references - I don't know how the wrong characters were encoded in the SQL database.
Greetings, Jakob
BTW: I doubt that anyone has ever tried to validate the huge XML dump as a whole - as far as I know validating XML streams (given an XML schema) is still a research topic. It's not the only part where MediaWiki touches the research border of current computer science :-)
Hi again,
I wrote:
When I tried to parse the current German XML dump I discovered the following malformed sequence (in [[de:India]]):
You can remove the errors with a little perl script - only a workaround for the current dump:
#!/usr/bin/perl
while(<>) { next if ($_ =~ /^[[got:�/); if ($_ =~ /^[[zh:�/) { print "</text>"; } else { print $_; } }
Jakob Voss wrote:
Hi again,
I wrote:
When I tried to parse the current German XML dump I discovered the following malformed sequence (in [[de:India]]):
You can remove the errors with a little perl script - only a workaround for the current dump:
For me this worked fine: Replace every "&#" with "&#" so the XML parser won't see the entity (first I used sed, now my program does the replacement before giving the stream to the parser). Of course the program using the data will have to care about it.
de:SirJective
Jakob Voss wrote:
When I tried to parse the current German XML dump I discovered the following malformed sequence (in [[de:India]]):
[[got:��...
It looks like someone tried to encode a unicode surogate pair with XML character references. Maybe MediaWiki does not recognize #xD800; as an invalid unicode character and transformed it into this form. I have not tried to send invalid unicode characters in an edit form to reproduce the error.
MediaWiki's UTF-8 validation should reject a literal U+D800, and a literal � would of course be transformed into &#xD800; in XML output.
Could be a bug in Mono's XmlWriter implementation. (The dumps from MediaWiki are filtered and split into multiple streams by a program I wrote in C# to produce full, current-only, and current-non-talk- non-userpage dumps from one run.) I'll take a look.
BTW: I doubt that anyone has ever tried to validate the huge XML dump as a whole - as far as I know validating XML streams (given an XML schema) is still a research topic. It's not the only part where MediaWiki touches the research border of current computer science :-)
I have done test validations of the XML dumps as a whole before, using Xerces. Here's the shell script wrapper I use:
#!/bin/sh XERCES=/home/brion/src/xerces/xerces-2_6_2 java -classpath $XERCES/xercesImpl.jar:$XERCES/xercesSamples.jar sax.Counter -n -v -s -f $@
A working file: $ schema-check demo2.xml demo2.xml: 31636 ms (17286 elems, 1736 attrs, 0 spaces, 433774 chars)
With �� slipped in: $ schema-check demox.xml [Fatal Error] demox.xml:48:15: Character reference "�" is an invalid XML character.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Could be a bug in Mono's XmlWriter implementation. (The dumps from MediaWiki are filtered and split into multiple streams by a program I wrote in C# to produce full, current-only, and current-non-talk- non-userpage dumps from one run.) I'll take a look.
Now filed as http://bugzilla.ximian.com/show_bug.cgi?id=76095
Will see about fixing...
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Could be a bug in Mono's XmlWriter implementation. (The dumps from MediaWiki are filtered and split into multiple streams by a program I wrote in C# to produce full, current-only, and current-non-talk- non-userpage dumps from one run.) I'll take a look.
Now filed as http://bugzilla.ximian.com/show_bug.cgi?id=76095
Will see about fixing...
Have submitted a patch. The next dump should be correct.
Have I mentioned how much I hate UTF-16 and how a 16-bit "char" type promotes the writing of naive code that doesn't take surrogate pairs into account?
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Now filed as http://bugzilla.ximian.com/show_bug.cgi?id=76095
Will see about fixing...
Have submitted a patch. The next dump should be correct.
Patch accepted into Mono subversion repository. These guys are fast. :)
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Brion Vibber wrote:
Now filed as http://bugzilla.ximian.com/show_bug.cgi?id=76095
Will see about fixing...
Have submitted a patch. The next dump should be correct.
Patch accepted into Mono subversion repository. These guys are fast. :)
-- brion vibber (brion @ pobox.com)
Added that to our bugzilla: http://bugzilla.wikimedia.org/show_bug.cgi?id=3473
wikitech-l@lists.wikimedia.org