Proposal: switch to HTML 5 - Wikitech-l

7 Jul 2009


      HTML 5 is the up-and-coming version of the HTML standard, which
supports all sorts of new and exciting features.  For those who don't
know about it, here's some background:
Wikipedia article: http://en.wikipedia.org/wiki/HTML_5
Summary of major differences from HTML 4: http://www.w3.org/TR/html5-diff/
Full specification: http://dev.w3.org/html5/spec/Overview.html
It's clear at this point that HTML 5 will be the next version of HTML.
 It was obvious for a long time that XHTML was going nowhere, but now
it's official: the XHTML working group has been disbanded and work on
all non-HTML 5 variants of HTML has ceased.  (Source:
http://www.w3.org/2009/06/xhtml-faq.html)  MediaWiki will have to
switch to HTML 5 sooner or later.  It's a great standard, and I think
we would do well to be early on the curve here and help spark interest
in and support for it.
HTML 5 is designed to be backward-compatible with legacy content, both
on the authoring side and (especially) the implementation side.
Well-written XHTML 1.0 should theoretically need only minor
modifications to validate as HTML 5, and indeed this appears to be the
case in practice.  All that's required to get a typical page in
Monobook validating as HTML 5 in the W3C's experimental validator is
(*if* we disregard user-added markup):
* Change the doctype to "<!doctype html>".
* Delete '<meta http-equiv="Content-Style-Type" content="text/css"
/>'.  Which is a really stupid element anyway.  :P
* Delete name attributes from all <a> elements.  They've been
redundant to id for eternity, and every browser in the universe
supports id; we can finally move these to the headers themselves.
* Remove comments from inside <script> tags with a src attribute.  I
already did this in r52828, since they're pointless anyway.
(The W3C validator is at http://validator.w3.org/.  You can override
the doctype and set it to interpret Wikipedia URLs as HTML 5 under
"More Options".)
Note that HTML 5 does follow in the "strict" vein of XHTML.
Presentational elements and attributes such as font, border,
cellpadding, etc. are all invalid in HTML 5.  (Implementations must
support them, but conforming documents must not use them.  <b> and <i>
remain valid.)  There's very little of this stuff left in the HTML
that ships with the software.  We can remove this incrementally as
it's reported.
For user-added content, I think it's fair to just treat it as GIGO --
if they submit invalid content that can't be easily converted to a
valid form, it will be output as-is.  Users can already submit invalid
content in cases where we can't easily fix it, e.g., duplicate id's.
If we switch to HTML 5, the W3C validator will begin outputting errors
on this presentational stuff, which should hopefully encourage users
to reduce it over time, at least in high-profile places like the front
page or infoboxes.
So converting to HTML 5 would be trivial.  However, in addition to
lending our support to good standards, there are several modest
practical benefits that would accrue from the switch.  I include here
only things that are possible in valid HTML 5 documents, but which
would not validate as XHTML 1 (so excluding stuff like localStorage);
and which are usable right now (so excluding stuff like <nav>, <input
type=color>, etc.):
* HTML 5 permits omission of a lot of the cruft that XHTML requires.
It permits leaving off ending tags in most cases where that's
unambiguous, and leaving off some required tags entirely (such as
<html>, <head>, and <body> if they have no attributes).  The "/>"
ending is no longer required.  Superfluous attributes like
type=text/javascript on <script> are no longer needed (unless you want
to use <script type=application/x-python or something, of course!).
Quotes may be omitted from attributes in almost all cases.  The
doctype is shorter and easy to remember, and there is no xmlns
attribute.  For an example of how compact valid HTML 5 can be, look at
the source of http://aryeh.name/.  I once did a crude test and found
we could cut 5% or so off the length of our HTML by doing this --
*after* gzipping.  Not only does this make our code smaller, it will
also make it easier to read.
* We could support <video>/<audio> on conformant user agents without
the use of JavaScript.  There's no reason we should need JS for
Firefox 3.5, Chrome 3, etc.
* We can use data-* attributes to store custom data for scripts.  This
came up in the case of the HTML diff work: the author of that stuck
some data for scripts in custom attributes, which caused XHTML 1
validation to fail.
* We can use HTML 5 form attributes.  These will enhance the
experience of users of appropriate browsers, and do nothing for
others.  At least Opera 9.6x already supports almost all HTML 5 form
attributes.  (Source:
http://www.opera.com/docs/specs/presto211/forms/)  We could, for
instance, give required fields the "required" attribute, which will
cause the browser to prevent the form submission and notify the user
if they aren't filled in, without needing either JavaScript or a
server-side check.  The "pattern" attribute even allows requiring that
the input match a regex, and this is also supported by Opera 9.6x.
See http://dev.w3.org/html5/spec/Overview.html#common-input-element-attributes.
* There are a couple of parser tests that currently fail because of
misnested tags.  If we altered the parser to no longer output any </p>
tags (which HTML 5 permits), these tests would immediately pass.  It
doesn't look like anyone's going to fix them otherwise.
These are only a few of the things that have immediate concrete
benefit.  There are probably more I couldn't find immediately (HTML 5
is a huge spec), and of course in the long term there's an incredible
amount that would be invaluable to us.
I propose the following migration plan:
1) Fix the doctype, Content-Style-Type, and name attributes.  We can
then officially claim we're shipping HTML 5!  :)  (Albeit maybe
invalid in some cases.)  Also remove any unnecessary attributes and
elements, without breaking XML well-formedness.  Begin using HTML 5
form attributes and any other useful features.  Poke the Cortado
people about letting <video> work without JavaScript.
2) Once this goes live, if no problems arise, try causing an XML
well-formedness error.  For instance, remove the quote marks around
one attribute of an element that's included in every page.  I suggest
this as a separate step because I suspect there are some bot operators
who are doing screen-scraping using XML libraries, so it would be a
good idea to assess how feasible it is at the present time to stop
being well-formed.  In the long run, of course, those bot operators
should switch to using the API.  If we receive enough complaints once
this goes live, we can revert it and continue to ship HTML 5 that's
also well-formed XML, for the time being.
3) If XML well-formedness is not a problem, get rid of all unneeded
closing tags, quotation marks, self-closing "/>" constructs, etc.
Create an Html class like Xml, which will generate elements in the
nice compact form that HTML 5 permits, and phase out use of Xml in
favor of Html.  (Xml has long since ceased to be purely about XML
anyway.)
So, what are people's thoughts?