RDFa and Microdata in MediaWiki - Wikitech-l

15 Jan 2010


      Duesentrieb checked in RDFa support for MediaWiki in r58712:
http://www.mediawiki.org/wiki/Special:Code/MediaWiki/58712
I discussed this with him at some length, and Tim commented on how it
ties into the parser.  I'd like to discuss this a bit more broadly
because we're talking about extending wikitext -- whatever markup we
allow on Wikipedia (and in this case, particularly on Commons) at the
next scap is probably going to have to be allowed forever by default
in MediaWiki, because everyone will start using it and pages will
break if we disable it.
RDFa is a way to embed data in HTML more robustly than with attributes
like class and title, which are reserved for author use or have
existing functionality.  It allows you to specify an external
vocabulary that adds some semantics to your page that HTML is not
capable of expressing by itself.  RDFa is based on the RDF standard,
and is relatively old.  Microdata is a new competing standard that was
created last year as part of HTML5, which aims to be much simpler to
use.
The major use case we have is marking up Commons image licenses.
Either RDFa or Microdata could allow machines to more easily tell what
licenses the images we use are under.  But in the long term, it seems
likely that only one of these technologies will win, and the other
will die.  We don't want to have to support the loser forever.  So IMO
we should choose the better one and go with that alone.
Now, which to choose?  RDFa is better-established, and the W3C is
still attached to it, but Microdata has much greater support among the
parties that matter, including Google, Mozilla, Apple, and Opera (as
judged from discussions in the WHATWG and W3C).  It's a lot more
concise and simpler to use, is better integrated into HTML, and can
represent any semantics we'd want.  At the bottom of this post is an
example exhibiting how much simpler microdata is.  Both RDFa+HTML and
Microdata are Working Drafts at the W3C right now, although RDFa in
XHTML1 (which we won't be using for much longer) is a Recommendation.
I should note that currently Google and a couple of others support
RDFa but not Microdata.  But come on -- we're Wikipedia.  Google
already screen-scrapes our templates to figure out what licenses we
use anyway, parsing microdata has got to be easier.  We shouldn't let
existing market shares deter us from picking the better technology.
My personal opinion on this is that we should enable Microdata by
default (which is much less intrusive than enabling RDFa -- just
whitelist a few extra attributes) and encourage Commons to use that
instead of RDFa.  We can leave RDFa support in, but disabled by
default.  What does everyone else think?
== Example of RDFa vs. Microdata ==
Suppose we have the following markup right now:
[[
<div id="bodyContent">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480">
...
<p>EmeryMolyneux-terrestrialglobe-1592-20061127.jpg by Bob Smith is
licensed under a <a
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
]]
Sample RDFa code to say an image is under a CC-BY-SA 3.0 license seems
to be something like this, based off the license generator on the CC
website:
[[
<div id="bodyContent">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480" id="mw-image">
...
<p><span xmlns:dc="http://purl.org/dc/elements/1.1/"
href="http://purl.org/dc/dcmitype/StillImage" property="dc:title"
rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image"
property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span>
is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
]]
This adds an id to the image, rel="license" to the license link, and
two extra tags with lots of lengthy attributes.  To be valid RDFa, we
would need to add further markup somewhere, at least a version tag in
the <html> tag on every page AFAIK.  Equivalent microdata is this:
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480" itemprop="work">
...
<p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
by <span itemprop="author">Bob Smith</span> is licensed under a <a
itemprop="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
This adds two attributes to an ancestor to indicate that the contents
form a work -- these could be moved to lower elements if desired,
AFAICT, but then they'd have to be duplicated.  Instead of adding an
id to the <img>, it uses itemprop="work" to directly say it's the work
being referred to.  Instead of <span
xmlns:dc="http://purl.org/dc/elements/1.1/"
href="http://purl.org/dc/dcmitype/StillImage" property="dc:title"
rel="dc:type">, we have <span itemprop="title">.  Instead of <span
xmlns:cc="http://creativecommons.org/ns#" href="#mw-image"
property="cc:attributionName" rel="cc:attributionURL">, we have <span
itemprop="author">.
Overall, I think it's clear from this example that microdata is much
more concise and also more coherent.  It's easy to see from this
example exactly how the microdata model works: you have a bunch of
stuff grouped as an item using itemscope, itemtype tells you what type
of item it is, and then itemprop tells you what each role each piece
has.  It's barely longer than the un-annotated markup.  RDFa, by
contrast, is a mess of boilerplate that's impossible to understand
unless you actually read the specs.  Microdata's syntax has actually
been refined by a usability study run on it by Google.