HTML 5 is the up-and-coming version of the HTML standard, which
supports all sorts of new and exciting features. For those who don't
know about it, here's some background:
Wikipedia article: http://en.wikipedia.org/wiki/HTML_5
Summary of major differences from HTML 4: http://www.w3.org/TR/html5-diff/
Full specification: http://dev.w3.org/html5/spec/Overview.html
It's clear at this point that HTML 5 will be the next version of HTML.
It was obvious for a long time that XHTML was going nowhere, but now
it's official: the XHTML working group has been disbanded and work on
all non-HTML 5 variants of HTML has ceased. (Source:
<http://www.w3.org/2009/06/xhtml-faq.html>) MediaWiki will have to
switch to HTML 5 sooner or later. It's a great standard, and I think
we would do well to be early on the curve here and help spark interest
in and support for it.
HTML 5 is designed to be backward-compatible with legacy content, both
on the authoring side and (especially) the implementation side.
Well-written XHTML 1.0 should theoretically need only minor
modifications to validate as HTML 5, and indeed this appears to be the
case in practice. All that's required to get a typical page in
Monobook validating as HTML 5 in the W3C's experimental validator is
(*if* we disregard user-added markup):
* Change the doctype to "<!doctype html>".
* Delete '<meta http-equiv="Content-Style-Type" content="text/css"
/>'. Which is a really stupid element anyway. :P
* Delete name attributes from all <a> elements. They've been
redundant to id for eternity, and every browser in the universe
supports id; we can finally move these to the headers themselves.
* Remove comments from inside <script> tags with a src attribute. I
already did this in r52828, since they're pointless anyway.
(The W3C validator is at http://validator.w3.org/. You can override
the doctype and set it to interpret Wikipedia URLs as HTML 5 under
"More Options".)
Note that HTML 5 does follow in the "strict" vein of XHTML.
Presentational elements and attributes such as font, border,
cellpadding, etc. are all invalid in HTML 5. (Implementations must
support them, but conforming documents must not use them. <b> and <i>
remain valid.) There's very little of this stuff left in the HTML
that ships with the software. We can remove this incrementally as
it's reported.
For user-added content, I think it's fair to just treat it as GIGO --
if they submit invalid content that can't be easily converted to a
valid form, it will be output as-is. Users can already submit invalid
content in cases where we can't easily fix it, e.g., duplicate id's.
If we switch to HTML 5, the W3C validator will begin outputting errors
on this presentational stuff, which should hopefully encourage users
to reduce it over time, at least in high-profile places like the front
page or infoboxes.
So converting to HTML 5 would be trivial. However, in addition to
lending our support to good standards, there are several modest
practical benefits that would accrue from the switch. I include here
only things that are possible in valid HTML 5 documents, but which
would not validate as XHTML 1 (so excluding stuff like localStorage);
and which are usable right now (so excluding stuff like <nav>, <input
type=color>, etc.):
* HTML 5 permits omission of a lot of the cruft that XHTML requires.
It permits leaving off ending tags in most cases where that's
unambiguous, and leaving off some required tags entirely (such as
<html>, <head>, and <body> if they have no attributes). The "/>"
ending is no longer required. Superfluous attributes like
type=text/javascript on <script> are no longer needed (unless you want
to use <script type=application/x-python or something, of course!).
Quotes may be omitted from attributes in almost all cases. The
doctype is shorter and easy to remember, and there is no xmlns
attribute. For an example of how compact valid HTML 5 can be, look at
the source of http://aryeh.name/. I once did a crude test and found
we could cut 5% or so off the length of our HTML by doing this --
*after* gzipping. Not only does this make our code smaller, it will
also make it easier to read.
* We could support <video>/<audio> on conformant user agents without
the use of JavaScript. There's no reason we should need JS for
Firefox 3.5, Chrome 3, etc.
* We can use data-* attributes to store custom data for scripts. This
came up in the case of the HTML diff work: the author of that stuck
some data for scripts in custom attributes, which caused XHTML 1
validation to fail.
* We can use HTML 5 form attributes. These will enhance the
experience of users of appropriate browsers, and do nothing for
others. At least Opera 9.6x already supports almost all HTML 5 form
attributes. (Source:
<http://www.opera.com/docs/specs/presto211/forms/>) We could, for
instance, give required fields the "required" attribute, which will
cause the browser to prevent the form submission and notify the user
if they aren't filled in, without needing either JavaScript or a
server-side check. The "pattern" attribute even allows requiring that
the input match a regex, and this is also supported by Opera 9.6x.
See <http://dev.w3.org/html5/spec/Overview.html#common-input-element-attributes>.
* There are a couple of parser tests that currently fail because of
misnested tags. If we altered the parser to no longer output any </p>
tags (which HTML 5 permits), these tests would immediately pass. It
doesn't look like anyone's going to fix them otherwise.
These are only a few of the things that have immediate concrete
benefit. There are probably more I couldn't find immediately (HTML 5
is a huge spec), and of course in the long term there's an incredible
amount that would be invaluable to us.
I propose the following migration plan:
1) Fix the doctype, Content-Style-Type, and name attributes. We can
then officially claim we're shipping HTML 5! :) (Albeit maybe
invalid in some cases.) Also remove any unnecessary attributes and
elements, without breaking XML well-formedness. Begin using HTML 5
form attributes and any other useful features. Poke the Cortado
people about letting <video> work without JavaScript.
2) Once this goes live, if no problems arise, try causing an XML
well-formedness error. For instance, remove the quote marks around
one attribute of an element that's included in every page. I suggest
this as a separate step because I suspect there are some bot operators
who are doing screen-scraping using XML libraries, so it would be a
good idea to assess how feasible it is at the present time to stop
being well-formed. In the long run, of course, those bot operators
should switch to using the API. If we receive enough complaints once
this goes live, we can revert it and continue to ship HTML 5 that's
also well-formed XML, for the time being.
3) If XML well-formedness is not a problem, get rid of all unneeded
closing tags, quotation marks, self-closing "/>" constructs, etc.
Create an Html class like Xml, which will generate elements in the
nice compact form that HTML 5 permits, and phase out use of Xml in
favor of Html. (Xml has long since ceased to be purely about XML
anyway.)
So, what are people's thoughts?
I propose that an additional checksum of the revision text be added to
the mediawiki database and that this checksum be made available via the
database dumps and api calls.
This additional field would allow many computations such as revert and
noop detection without having to ask the system to provide the full text
of revisions. For example, if I were to build a user script to show
users which revisions have been reverted, it would be beneficial to not
have to ask the API for the full text of a large list of revisions. On
that same note, even when I need the full text of revisions, I could
determine which revisions I do not need to request by determining that
their content is exactly the same as one that has already been retrieved.
It does not seem that such a field would require considerably more
storage or computational power since computing an MD5 checksum in PHP is
cheap and storing 32 hex characters compared to the size of an articles
text is not appreciable.
Thanks,
-Aaron Halfaker
Who is in charge of the Wikimedia Technical Blog?
Consider adding a "contact" link on your blog.
However, make sure it doesn't go through the same filter as the
following problem:
After each comment I post, logged in or not, it says:
Your location has been identified as part of a reported spam network. Comments have been disabled to prevent spam.
Yesterday from 122.127.*, Today from 218.163.*
https://bugzilla.wikimedia.org/show_bug.cgi?id=19540http://lists.wikimedia.org/pipermail/wikitech-l/2009-July/043834.html
Hi!
an interesting fact from operations.. ;-) a template change on
commons.wikimedia.org grew the database by 50% in a week or so,
pushing all s3 servers close to 100% disk use... *ouch*
Domas
Sorry if this is not the correct forum to report a possible bug.
I noticed browsing the "Example of second-order singular perturbation
theory" section of the "Perturbation Theory" page on wikipedia:
---
Consider the following equation for the unknown variable <math>x</math>:
:<math>x=1+\epsilon x^5.</math>
For the initial problem with <math>\epsilon=0</math>, the solution is
<math>x_0=1</math>. For small <math>\epsilon</math> the lowest order
approximation may be found by inserting the [[ansatz]]
:<math>x=x_0+\epsilon x_1 (+\cdots)</math>
--
that the markup fonts of "x" in each of the :<math> sections are different.
In particular there are no serifs on the top x's in the first equation and
x_0=1 on my Linux box. On the others, there are small serifs.
I don't see why they should be rendering differently. The ascii certainly
doesn't indicate anything to me...
I've tested this on:
IE7.0 Windows
Safari 4.0 on OSX 10.5
Firefox 3.5 on OSX 10.5
Firefox 3.0.10 on Linux 2.6.18-128.1.10.el5
I've attached a small screenshot of what I see.
M
I don't want to irritate people by asking inappropriate questions on this list. So please direct me to the right list if this is the wrong one for this question.
I ran parserTests and 45 tests failed. The result was:
Passed 559 of 604 tests (92.55%)... 45 tests failed!
I expect this indicates a problem, but sometimes test suites are set up so certain tests fail. Is this result good or bad?
Dan
I have been struggling to figure out how to run the parser tests. From the very limited documentation in the code, it appears you are supposed to run them from a terminal. However, when I cd to the maintenance directory and type "php parserTests.php" I get the following error message.
Parse error: parse error, expecting `T_OLD_FUNCTION' or `T_FUNCTION' or `T_VAR' or `'}'' in /Users/dnessett/Sites/Mediawiki/maintenance/parserTests.inc on line 43
Either there is some setup necessary that I haven't done; parserTests.php is not the appropriate "top-level" target for the execution; you are not supposed to run these tests from the terminal; or there is something else I am doing wrong.
I tried to find documentation on how to run the tests without success. If I simply haven't looked in the right place, a quick pointer to the appropriate instructions would be great. Otherwise, I wonder if someone could instruct me how to run them.
Thanks,
Dan
--- On Fri, 7/10/09, Aryeh Gregor <Simetrical+wikilist(a)gmail.com> wrote:
> From: Aryeh Gregor <Simetrical+wikilist(a)gmail.com>
> Subject: Re: [Wikitech-l] How do you run the parserTests?
> To: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>
> Date: Friday, July 10, 2009, 3:01 PM
> On Fri, Jul 10, 2009 at 5:46 PM, dan
> nessett<dnessett(a)yahoo.com>
> wrote:
> > MediaWiki 1.14.0
> > PHP 5.2.6 (apache2handler)
> > MySQL 5.0.41
> >
> > ...
> > private $color;
>
> I'm going to bet that php on your command line is actually
> PHP 4, not
> PHP 5. Try php -v to check this. Using "php5
> parserTests.php" might
> work.
>
Good guess. It appears I have both php4 and php5 installed.
Thanks.
Dan