[Wikitech-l] XML dump not well-formed because of unicode

13 Sep 2005

Hi,

When I tried to parse the current German XML dump I discovered the
following malformed sequence (in [[de:India]]):

[[got:&#xD800;&#xDF39;...

It looks like someone tried to encode a unicode surogate pair with
XML character references. Maybe MediaWiki does not recognize #xD800;
as an invalid unicode character and transformed it into this form.
I have not tried to send invalid unicode characters in an edit form
to reproduce the error.

Anyway the dump is broken. It's not well-formed XML (so it's no XML
at all but "looks-like-XML") and every correct XML-Parser will fail
to parse it.

According the the XML specification (1.0) Chapter 2.2 legal characters
in XML are any Unicode characters, excluding the surrogate blocks,
FFFE, and FFFF.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]

Using any the following unicode character will make SpecialExport and
the XML dump fail:

#x0-#x8, #xB-#xC, #x0E-#x1F, #xD800-#xDFFF, #xFFFE-#xFFFF, &#x11000-...

Additionally you can use hexadecimal and decimal character
references - I don't know how the wrong characters were
encoded in the SQL database.

Greetings,
Jakob

BTW: I doubt that anyone has ever tried to validate the huge XML dump as 
a whole - as far as I know validating XML streams (given an XML schema) 
is still a research topic. It's not the only part where MediaWiki 
touches the research border of current computer science :-)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] XML dump not well-formed because of unicode