On Mon, Apr 26, 2010 at 5:04 PM, Platonides Platonides@gmail.com wrote:
I suppose that you could link to a local copy of the DTD, that would keep happy but would probably break more parsing, since html doctypes are more or less magic words for many programs dealing with it (beginning with browsers, but some validators also do so). I would prefer not having to deal with the less developer friendly numeric entities in the html.
You'd think this would be annoying, but in fact, in article text we've always converted entities to UTF-8, and I've never actually been inconvenienced by it. Or even noticed it, without actually testing. I actually don't think • is less developer-friendly than •, say. The only common one where the UTF-8 form would be annoying is , and that's just one code point to remember,  . Not to mention that 90% of could be replaced by a normal space with no actual change.
If we are serving HTML5 (not XHTML) why is XML weel-formedness important? I thought that HTML5 means giving up on it.
""" But HTML5 is tag soup! HTML5 doesn't require XML well-formedness – e.g., you can omit attribute quote marks – but it does permit it. MediaWiki currently still outputs well-formed XML by default. This means that by default, you can still (modulo bugs) parse MediaWiki pages using XML libraries, transform them via XSLT, etc. MediaWiki administrators who want to reduce the size of output HTML can disable $wgWellFormedXml. When HTML5 has been around for a while and HTML5 parsing libraries are as prevalent as XML parsing libraries, this benefit might not be so compelling anymore. """ http://www.mediawiki.org/wiki/HTML5#FAQ_about_MediaWiki_use_of_HTML5
In practice, tons of bots still do screen-scraping using XML libraries, and we get a lot of complaints very quickly if we start serving many non-well-formed pages. They should use the API instead, of course -- which is why I'm not *too* worried about the occasional entity creeping through and malforming a page, if we do use just <!DOCTYPE html>. Screen-scrapers should die anyway. :)