Duesentrieb checked in RDFa support for MediaWiki in r58712:
http://www.mediawiki.org/wiki/Special:Code/MediaWiki/58712
I discussed this with him at some length, and Tim commented on how it ties into the parser. I'd like to discuss this a bit more broadly because we're talking about extending wikitext -- whatever markup we allow on Wikipedia (and in this case, particularly on Commons) at the next scap is probably going to have to be allowed forever by default in MediaWiki, because everyone will start using it and pages will break if we disable it.
RDFa is a way to embed data in HTML more robustly than with attributes like class and title, which are reserved for author use or have existing functionality. It allows you to specify an external vocabulary that adds some semantics to your page that HTML is not capable of expressing by itself. RDFa is based on the RDF standard, and is relatively old. Microdata is a new competing standard that was created last year as part of HTML5, which aims to be much simpler to use.
The major use case we have is marking up Commons image licenses. Either RDFa or Microdata could allow machines to more easily tell what licenses the images we use are under. But in the long term, it seems likely that only one of these technologies will win, and the other will die. We don't want to have to support the loser forever. So IMO we should choose the better one and go with that alone.
Now, which to choose? RDFa is better-established, and the W3C is still attached to it, but Microdata has much greater support among the parties that matter, including Google, Mozilla, Apple, and Opera (as judged from discussions in the WHATWG and W3C). It's a lot more concise and simpler to use, is better integrated into HTML, and can represent any semantics we'd want. At the bottom of this post is an example exhibiting how much simpler microdata is. Both RDFa+HTML and Microdata are Working Drafts at the W3C right now, although RDFa in XHTML1 (which we won't be using for much longer) is a Recommendation.
I should note that currently Google and a couple of others support RDFa but not Microdata. But come on -- we're Wikipedia. Google already screen-scrapes our templates to figure out what licenses we use anyway, parsing microdata has got to be easier. We shouldn't let existing market shares deter us from picking the better technology. My personal opinion on this is that we should enable Microdata by default (which is much less intrusive than enabling RDFa -- just whitelist a few extra attributes) and encourage Commons to use that instead of RDFa. We can leave RDFa support in, but disabled by default. What does everyone else think?
== Example of RDFa vs. Microdata == Suppose we have the following markup right now:
[[ <div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480"> ... <p>EmeryMolyneux-terrestrialglobe-1592-20061127.jpg by Bob Smith is licensed under a <a href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> ]]
Sample RDFa code to say an image is under a CC-BY-SA 3.0 license seems to be something like this, based off the license generator on the CC website:
[[ <div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" id="mw-image"> ... <p><span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> ]]
This adds an id to the image, rel="license" to the license link, and two extra tags with lots of lengthy attributes. To be valid RDFa, we would need to add further markup somewhere, at least a version tag in the <html> tag on every page AFAIK. Equivalent microdata is this:
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" itemprop="work"> ... <p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
This adds two attributes to an ancestor to indicate that the contents form a work -- these could be moved to lower elements if desired, AFAICT, but then they'd have to be duplicated. Instead of adding an id to the <img>, it uses itemprop="work" to directly say it's the work being referred to. Instead of <span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">, we have <span itemprop="title">. Instead of <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">, we have <span itemprop="author">.
Overall, I think it's clear from this example that microdata is much more concise and also more coherent. It's easy to see from this example exactly how the microdata model works: you have a bunch of stuff grouped as an item using itemscope, itemtype tells you what type of item it is, and then itemprop tells you what each role each piece has. It's barely longer than the un-annotated markup. RDFa, by contrast, is a mess of boilerplate that's impossible to understand unless you actually read the specs. Microdata's syntax has actually been refined by a usability study run on it by Google.
On Fri, Jan 15, 2010 at 10:47 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Sample RDFa code to say an image is under a CC-BY-SA 3.0 license seems to be something like this, based off the license generator on the CC website:
[[
<div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" id="mw-image"> ... <p><span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> ]]
It was pointed out in #whatwg on freenode that to be fair, I should leave off the fact that the work being pointed to is a still image (since Microdata does). On the other hand, the span needs to point to the actual URL of the image, not just an ID, so I *think* this is the markup I actually wanted:
[[ <div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480"> ... <p><span xmlns:dc="http://purl.org/dc/elements/1.1/" property="dc:title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> ]]
This is about as long as before, but it might still be wrong. The general points I made are still accurate, anyway.
Second, it was pointed out that the RDFa example here mixes two existing vocabularies, while the Microdata example uses a vocabulary specifically designed for our use-case. However, I think this is fair -- we'd likely use the standard applicable vocabularies in each case, and the Microdata vocabulary is simpler for our primary use-case.
Third of all, it was also pointed out that RDFa 1.1 is supposed to simpler. But RDFa 1.1 probably has about the same deployment right now as Microdata, i.e., roughly none, so that gets rid of RDF's biggest advantage.
But in the end, personal opinion aside, Microdata looks like the technology with a future right now, for good reason. The consensus of almost everyone I've talked to who's not precommitted to RDF is that Microdata is the better technology. Since existing deployment isn't a huge issue for us given our size -- we'll become one of the biggest web users of whichever technology we choose -- I think we should go with Microdata as the apparent better solution, unless anyone has reasons not to.
wikitech-l@lists.wikimedia.org