On Mon, Jul 2, 2012 at 7:36 PM, Rob Lanphier robla@wikimedia.org wrote:
That plan may be more conservative than we need to be, given it's been enabled on mediawiki.org for so long. At the time Aryeh wrote that, the feature hadn't been as well tested as it is now. That's not to say that we won't find bugs, but that I don't think there will be as many, that they aren't likely to be severe, and it seems we're in a better position to address them quickly than we were when that was written. I wouldn't mind going that route if a lot of other people feel we should, but it seems likely to me that we might accidentally introduce production glitches in the process of implementing the interim steps, and that there could very well be bugs in the interim states that don't occur in the final stage.
Just to clarify the history here, I originally suggested just turning it on. I expected (and expect) that there will be a bit of fallout, but not a lot -- it should be quickly fixable. The stuff that carries bigger compatibility risks is behind separate switches such as $wgWellFormedXml and $wgExperimentalHtmlIds.
Are you sure that $wgHtml5 is distinct from the doctype? It looks like mediawiki.org also has the doctype set, and it looks as though Html.php sets it based on that variable.
IIRC, I added a separate variable that allows changing the doctype separately from $wgHtml5 in case anyone wanted to experiment with changing the doctype and rest of the page separately. This is because changing the doctype will affect rendering in certain cases, moving from "almost-standards" to "standards" rendering, while changing the rest of the markup might have unrelated effects. But the doctype should change along with $wgHtml5 if you don't override it.
It's also unclear whether every issue reported in the comments of bug 27478 were filed as separate bugs. In particular, I'm unsure if Cite was ever properly fixed (or if Aryeh's mentioned alternate, stop-gap solution was implemented). As I recall, the Cite breakage was breaking links in articles.
This is what I'm hoping we can get some clarity on. How many of those comments are still relevant?
Comments 0-5 are still relevant. r82413 will likely need to be reinstated and enforced in review if you don't want to break XML processors. Named entities like will no longer work in XML parsers with no DTD in the doctype -- except for the core & < > " '. This is likely to be a big issue, because it will be a headache to make sure extensions don't output such entities in raw HTML. (The parser/sanitizer will already take care of them in user input or parsed HTML, though.) If auditing isn't put into place, I'd expect that XML parsers would break as soon as the change is deployed, and regularly break thereafter as people accidentally introduce new entities.
The way around this would be either to use a non-HTML5 doctype (see end of post), or just give up on XML scrapers and tell them that their bots will break until they switch to an HTML5 parser or the API. In the latter case, $wgWellFormedXml can be set to false also, if people like.
Comment 12 is no longer relevant, because $wgExperimentalHtmlIds was turned off by default.
http://lists.wikimedia.org/pipermail/wikitech-l/2011-June/053775.html is still a good summary of possible issues, particularly the emphasis on issue 2.
I don't know if comment 27 is still relevant -- probable, but it should be trivial to fix. There are likely to be some pages using table-based layout and images that will start displaying badly and that users will have to add a few extra style rules to fix.
The major issue that I see is still the named-entities problem, which is what led to rapid disabling both previous times $wgHtml5 was turned on. To avoid breaking XML tools, the doctype could be set to XHTML 1.0 Strict or such with $wgHtml5 on, so HTML5 features would still work. This would make the page valid HTML5, since HTML5 allows some legacy doctypes that do specify a DTD:
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#obso...
The issue is it would confuse validator.w3.org into trying to validate as XHTML 1.0 etc., which would make people complain the pages are invalid. You would have to set it specifically to validate as HTML5 for it to pass. (HTML5 validators are generally much pickier, though, so expect a lot of pages not to validate as HTML5 either.)
The alternative, as I said, would be to just let XML screen-scraper bots break. Most languages provide some type of HTML parser that they could be switched to, I do believe. Python has a particularly good HTML5 parser, I think, which will parse the page the same as browsers. In this case, switching off $wgWellFormedXml won't hurt anything and will decrease page size slightly.