Re: [Wikitech-l] XML dump not well-formed because of unicode

13 Sep 2005


      Jakob Voss wrote:
...
When I tried to parse the current German XML dump I discovered the
following malformed sequence (in [[de:India]]):
[[got:&#xD800;&#xDF39;...
It looks like someone tried to encode a unicode surogate pair with
XML character references. Maybe MediaWiki does not recognize #xD800;
as an invalid unicode character and transformed it into this form.
I have not tried to send invalid unicode characters in an edit form
to reproduce the error.
MediaWiki's UTF-8 validation should reject a literal U+D800, and a
literal &#xD800; would of course be transformed into &amp;#xD800; in XML
output.
Could be a bug in Mono's XmlWriter implementation. (The dumps from
MediaWiki are filtered and split into multiple streams by a program I
wrote in C# to produce full, current-only, and current-non-talk-
non-userpage dumps from one run.) I'll take a look.
...
BTW: I doubt that anyone has ever tried to validate the huge XML dump as
a whole - as far as I know validating XML streams (given an XML schema)
is still a research topic. It's not the only part where MediaWiki
touches the research border of current computer science :-)
I have done test validations of the XML dumps as a whole before, using
Xerces. Here's the shell script wrapper I use:
#!/bin/sh
XERCES=/home/brion/src/xerces/xerces-2_6_2
java -classpath $XERCES/xercesImpl.jar:$XERCES/xercesSamples.jar
sax.Counter -n -v -s -f $@
A working file:
$ schema-check demo2.xml
demo2.xml: 31636 ms (17286 elems, 1736 attrs, 0 spaces, 433774 chars)
With &#xD800;&#xDF39; slipped in:
$ schema-check demox.xml
[Fatal Error] demox.xml:48:15: Character reference "&#xD800" is an
invalid XML character.
-- brion vibber (brion @ pobox.com)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] XML dump not well-formed because of unicode