Named entity references and XML well-formedness - Wikitech-l

26 Apr 2010


      In XML, named entity references like &nbsp; and &bull; (with the
special exceptions of &lt; &gt; &amp; &quot; &apos;) can be treated as
well-formedness errors across the board by conformant XML processors.
(Yes, this means that *any* XML document that uses *any* named entity
reference except the special five is not well-formed, if you ask these
XML processors.)  Alternatively, if a DTD is provided, conformant XML
processors can retrieve the DTD, parse it, and treat the reference as
a well-formedness error if it doesn't occur in the DTD, otherwise
parse it as you'd expect.  (Yes, processors can really pick whichever
behavior they want, as far as I understand it.  As we all know, the
great thing about standards is how many there are to choose from.)
In practice, as far as I can tell, XML UAs that our users use do the
latter, retrieving the DTD.  (Otherwise they'd instantly break, and
our users wouldn't use them!)  Thus we get away with using &nbsp; and
such, and still work in these UAs.  But this means we have to provide
a doctype with a DTD, which means not just <!DOCTYPE html>.  This is
the default behavior on trunk -- we output an XHTML Strict DTD when
the document is actually HTML5.  This has a few disadvantages, in
addition to just being odd:
1) Validators treat the content as XHTML Strict, not HTML5, so it
fails validation unless you specifically ask for HTML5 validation.
I've already seen a couple of complaints about this, and we haven't
even released yet.  Lots of people care about validation.
2) XML processors are still within their rights to reject the page,
declining to process the DTD and treating the page as non-well-formed.
3) For XML processors that do process the DTD, we force them to do a
network load as soon as they start parsing the page.  Presumably this
slows down parsing (dunno how much in practice), and it also hurts the
W3C's poor servers:
http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
The alternative is to simply not use any named character references --
replace them all by numeric ones.  E.g., use &#160; instead of &nbsp;,
and · instead of &bull;.  Then we can use <!DOCTYPE html> by default
and avoid these problems.  In fact, we already do this for anything
that passes through the parser, as far as I can tell -- we convert it
to UTF-8.
The problem is that if we do this and then miss a few entities
somewhere in the source code, some pages will mysteriously become
non-well-formed and tools will break.  Plus, of course, you have the
usual risks of breakage from mass changes.  Overall, though, I'd
prefer that we do this, because the alternative is that I'd have to
pester the standards people and validator people for a means to let us
validate properly with an XHTML Strict doctype.
Are there any objections to me removing all named entity references
from MediaWiki output?