On Mon, Jul 2, 2012 at 7:36 PM, Rob Lanphier <robla(a)wikimedia.org> wrote:
That plan may be more conservative than we need to be,
given it's been
enabled on
mediawiki.org for so long. At the time Aryeh wrote that,
the feature hadn't been as well tested as it is now. That's not to
say that we won't find bugs, but that I don't think there will be as
many, that they aren't likely to be severe, and it seems we're in a
better position to address them quickly than we were when that was
written. I wouldn't mind going that route if a lot of other people
feel we should, but it seems likely to me that we might accidentally
introduce production glitches in the process of implementing the
interim steps, and that there could very well be bugs in the interim
states that don't occur in the final stage.
Just to clarify the history here, I originally suggested just turning
it on. I expected (and expect) that there will be a bit of fallout,
but not a lot -- it should be quickly fixable. The stuff that carries
bigger compatibility risks is behind separate switches such as
$wgWellFormedXml and $wgExperimentalHtmlIds.
Are you sure that $wgHtml5 is distinct from the
doctype? It looks
like
mediawiki.org also has the doctype set, and it looks as though
Html.php sets it based on that variable.
IIRC, I added a separate variable that allows changing the doctype
separately from $wgHtml5 in case anyone wanted to experiment with
changing the doctype and rest of the page separately. This is because
changing the doctype will affect rendering in certain cases, moving
from "almost-standards" to "standards" rendering, while changing the
rest of the markup might have unrelated effects. But the doctype
should change along with $wgHtml5 if you don't override it.
It's also
unclear whether every issue reported in the comments of bug 27478
were filed as separate bugs. In particular, I'm unsure if Cite was ever
properly fixed (or if Aryeh's mentioned alternate, stop-gap solution was
implemented). As I recall, the Cite breakage was breaking links in articles.
This is what I'm hoping we can get some clarity on. How many of those
comments are still relevant?
Comments 0-5 are still relevant. r82413 will likely need to be
reinstated and enforced in review if you don't want to break XML
processors. Named entities like will no longer work in XML
parsers with no DTD in the doctype -- except for the core & <
> " '. This is likely to be a big issue, because it will
be a headache to make sure extensions don't output such entities in
raw HTML. (The parser/sanitizer will already take care of them in
user input or parsed HTML, though.) If auditing isn't put into place,
I'd expect that XML parsers would break as soon as the change is
deployed, and regularly break thereafter as people accidentally
introduce new entities.
The way around this would be either to use a non-HTML5 doctype (see
end of post), or just give up on XML scrapers and tell them that their
bots will break until they switch to an HTML5 parser or the API. In
the latter case, $wgWellFormedXml can be set to false also, if people
like.
Comment 12 is no longer relevant, because $wgExperimentalHtmlIds was
turned off by default.
http://lists.wikimedia.org/pipermail/wikitech-l/2011-June/053775.html
is still a good summary of possible issues, particularly the emphasis
on issue 2.
I don't know if comment 27 is still relevant -- probable, but it
should be trivial to fix. There are likely to be some pages using
table-based layout and images that will start displaying badly and
that users will have to add a few extra style rules to fix.
The major issue that I see is still the named-entities problem, which
is what led to rapid disabling both previous times $wgHtml5 was turned
on. To avoid breaking XML tools, the doctype could be set to XHTML
1.0 Strict or such with $wgHtml5 on, so HTML5 features would still
work. This would make the page valid HTML5, since HTML5 allows some
legacy doctypes that do specify a DTD:
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#obs…
The issue is it would confuse
validator.w3.org into trying to validate
as XHTML 1.0 etc., which would make people complain the pages are
invalid. You would have to set it specifically to validate as HTML5
for it to pass. (HTML5 validators are generally much pickier, though,
so expect a lot of pages not to validate as HTML5 either.)
The alternative, as I said, would be to just let XML screen-scraper
bots break. Most languages provide some type of HTML parser that they
could be switched to, I do believe. Python has a particularly good
HTML5 parser, I think, which will parse the page the same as browsers.
In this case, switching off $wgWellFormedXml won't hurt anything and
will decrease page size slightly.