In XML, named entity references like and • (with the special exceptions of < > & " ') can be treated as well-formedness errors across the board by conformant XML processors. (Yes, this means that *any* XML document that uses *any* named entity reference except the special five is not well-formed, if you ask these XML processors.) Alternatively, if a DTD is provided, conformant XML processors can retrieve the DTD, parse it, and treat the reference as a well-formedness error if it doesn't occur in the DTD, otherwise parse it as you'd expect. (Yes, processors can really pick whichever behavior they want, as far as I understand it. As we all know, the great thing about standards is how many there are to choose from.)
In practice, as far as I can tell, XML UAs that our users use do the latter, retrieving the DTD. (Otherwise they'd instantly break, and our users wouldn't use them!) Thus we get away with using and such, and still work in these UAs. But this means we have to provide a doctype with a DTD, which means not just <!DOCTYPE html>. This is the default behavior on trunk -- we output an XHTML Strict DTD when the document is actually HTML5. This has a few disadvantages, in addition to just being odd:
1) Validators treat the content as XHTML Strict, not HTML5, so it fails validation unless you specifically ask for HTML5 validation. I've already seen a couple of complaints about this, and we haven't even released yet. Lots of people care about validation.
2) XML processors are still within their rights to reject the page, declining to process the DTD and treating the page as non-well-formed.
3) For XML processors that do process the DTD, we force them to do a network load as soon as they start parsing the page. Presumably this slows down parsing (dunno how much in practice), and it also hurts the W3C's poor servers: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
The alternative is to simply not use any named character references -- replace them all by numeric ones. E.g., use   instead of , and · instead of •. Then we can use <!DOCTYPE html> by default and avoid these problems. In fact, we already do this for anything that passes through the parser, as far as I can tell -- we convert it to UTF-8.
The problem is that if we do this and then miss a few entities somewhere in the source code, some pages will mysteriously become non-well-formed and tools will break. Plus, of course, you have the usual risks of breakage from mass changes. Overall, though, I'd prefer that we do this, because the alternative is that I'd have to pester the standards people and validator people for a means to let us validate properly with an XHTML Strict doctype.
Are there any objections to me removing all named entity references from MediaWiki output?
* Aryeh Gregor Simetrical+wikilist@gmail.com [Sun, 25 Apr 2010 20:46:14 -0400]:
In XML, named entity references like and • (with the special exceptions of < > & " ') can be treated as well-formedness errors across the board by conformant XML processors. (Yes, this means that *any* XML document that uses *any* named entity reference except the special five is not well-formed, if you ask these XML processors.) Alternatively, if a DTD is provided, conformant XML processors can retrieve the DTD, parse it, and treat the reference as a well-formedness error if it doesn't occur in the DTD, otherwise parse it as you'd expect. (Yes, processors can really pick whichever behavior they want, as far as I understand it. As we all know, the great thing about standards is how many there are to choose from.)
Wouldn't it be enough just to define an entity? http://www.criticism.com/dita/dtd2.html#section-ENTITIES I used such definition for nbsp once in XSL sheet. Don't know how well it works alone in XML. Dmitriy
On Mon, Apr 26, 2010 at 4:02 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
Wouldn't it be enough just to define an entity? http://www.criticism.com/dita/dtd2.html#section-ENTITIES I used such definition for nbsp once in XSL sheet. Don't know how well it works alone in XML.
I guess that would be possible, yes, but HTML defines an awful lot of entities, and adding them all inline to every page doesn't sound like a great idea to me.
Aryeh Gregor wrote:
On Mon, Apr 26, 2010 at 4:02 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
Wouldn't it be enough just to define an entity? http://www.criticism.com/dita/dtd2.html#section-ENTITIES I used such definition for nbsp once in XSL sheet. Don't know how well it works alone in XML.
I guess that would be possible, yes, but HTML defines an awful lot of entities, and adding them all inline to every page doesn't sound like a great idea to me.
I suppose that you could link to a local copy of the DTD, that would keep happy but would probably break more parsing, since html doctypes are more or less magic words for many programs dealing with it (beginning with browsers, but some validators also do so). I would prefer not having to deal with the less developer friendly numeric entities in the html.
If we are serving HTML5 (not XHTML) why is XML weel-formedness important? I thought that HTML5 means giving up on it. A HTML5 parser must implement the "HTML entities", so they shouldn't need a DTD. http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#char... http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-...
On Mon, Apr 26, 2010 at 5:04 PM, Platonides Platonides@gmail.com wrote:
I suppose that you could link to a local copy of the DTD, that would keep happy but would probably break more parsing, since html doctypes are more or less magic words for many programs dealing with it (beginning with browsers, but some validators also do so). I would prefer not having to deal with the less developer friendly numeric entities in the html.
You'd think this would be annoying, but in fact, in article text we've always converted entities to UTF-8, and I've never actually been inconvenienced by it. Or even noticed it, without actually testing. I actually don't think • is less developer-friendly than •, say. The only common one where the UTF-8 form would be annoying is , and that's just one code point to remember,  . Not to mention that 90% of could be replaced by a normal space with no actual change.
If we are serving HTML5 (not XHTML) why is XML weel-formedness important? I thought that HTML5 means giving up on it.
""" But HTML5 is tag soup! HTML5 doesn't require XML well-formedness – e.g., you can omit attribute quote marks – but it does permit it. MediaWiki currently still outputs well-formed XML by default. This means that by default, you can still (modulo bugs) parse MediaWiki pages using XML libraries, transform them via XSLT, etc. MediaWiki administrators who want to reduce the size of output HTML can disable $wgWellFormedXml. When HTML5 has been around for a while and HTML5 parsing libraries are as prevalent as XML parsing libraries, this benefit might not be so compelling anymore. """ http://www.mediawiki.org/wiki/HTML5#FAQ_about_MediaWiki_use_of_HTML5
In practice, tons of bots still do screen-scraping using XML libraries, and we get a lot of complaints very quickly if we start serving many non-well-formed pages. They should use the API instead, of course -- which is why I'm not *too* worried about the occasional entity creeping through and malforming a page, if we do use just <!DOCTYPE html>. Screen-scrapers should die anyway. :)
wikitech-l@lists.wikimedia.org