On Mon, Apr 26, 2010 at 5:04 PM, Platonides <Platonides(a)gmail.com> wrote:
I suppose that you could link to a local copy of the
DTD, that would
keep happy but would probably break more parsing, since html doctypes
are more or less magic words for many programs dealing with it
(beginning with browsers, but some validators also do so).
I would prefer not having to deal with the less developer friendly
numeric entities in the html.
You'd think this would be annoying, but in fact, in article text we've
always converted entities to UTF-8, and I've never actually been
inconvenienced by it. Or even noticed it, without actually testing.
I actually don't think • is less developer-friendly than •, say.
The only common one where the UTF-8 form would be annoying is ,
and that's just one code point to remember,  . Not to mention
that 90% of could be replaced by a normal space with no actual
change.
If we are serving HTML5 (not XHTML) why is XML
weel-formedness
important? I thought that HTML5 means giving up on it.
"""
But HTML5 is tag soup!
HTML5 doesn't require XML well-formedness – e.g., you can omit
attribute quote marks – but it does permit it. MediaWiki currently
still outputs well-formed XML by default. This means that by default,
you can still (modulo bugs) parse MediaWiki pages using XML libraries,
transform them via XSLT, etc. MediaWiki administrators who want to
reduce the size of output HTML can disable $wgWellFormedXml. When
HTML5 has been around for a while and HTML5 parsing libraries are as
prevalent as XML parsing libraries, this benefit might not be so
compelling anymore.
"""
<http://www.mediawiki.org/wiki/HTML5#FAQ_about_MediaWiki_use_of_HTML5>
In practice, tons of bots still do screen-scraping using XML
libraries, and we get a lot of complaints very quickly if we start
serving many non-well-formed pages. They should use the API instead,
of course -- which is why I'm not *too* worried about the occasional
entity creeping through and malforming a page, if we do use just
<!DOCTYPE html>. Screen-scrapers should die anyway. :)