Philip Jägenstedt wrote:
I don't suppose that the members of this list appreciate the epic Microdata vs. RDFa battle leaking into this mailing list
I wouldn't use such terms to frame the debate. The Microformats, Microdata and RDFa communities are not "battling" or working against each other - they're having a very necessary, spirited debate. Clearly, both communities are influencing the design of the other and clearly we need to have these discussions in order to make sure that we're creating the best possible technology for the future of the Web.
More importantly, the reason that all of us are working on this technology is because we care about how it is used to better humanity. At least, I hope that's why people are working on this stuff :). Certainly, we all hold Wikipedia in high regard and want what's best for this community as well.
It's not /unfortunate/ that we're having the discussion here - it was inevitable.
I'm delighted by the fact that we're even having this debate. It took ages to convince the WHAT WG that this was a problem that needed to be addressed[1] just 18 months ago.
So, we can either grit our teeth and begrudgingly go through the motions, or we can welcome the debate to come.
I choose to do the latter because I know that all of us will learn something from it and better understand the requirements for Wikimedia implementations. What we learn here will further influence guidance given to future communities, just as integrating RDFa with Drupal has influenced the advice that we may give to this community.
[ed: Microdata] is really quite intuitive and simple, with few surprises.
I agree on the first point - Microdata is pretty intuitive and simple, with few surprises. Although, I'd say the same for RDFa as well. I think we tend to forget, though, that Web semantics require a bit of effort to learn and the audience that is using the technology should be taken into account when deciding how to expose an authoring environment for the community.
I don't think that the best approach for Wikipedia is to allow direct Microdata or RDFa markup. There are already many templates in use at Wikipedia via Infobox - those templates could be leveraged to automatically generate RDFa in the same way that dbpedia.org uses those templates to generate RDF. The risk this community runs by allowing arbitrary semantic data markup is that contributors make mistakes causing half of the semantic data to be corrupted - making the rest of the data useless.
Neither Microdata nor RDFa come with few surprises for the beginner. Like all new web technologies, there is a learning curve for both of them and it's pretty similar since Microdata's design was influenced by RDFa and Microformats. More about the surprises with each, below.
[ed: Microdata] maps well to the RDF model if you want it, but doesn't force authors to think in terms of subject, predicate, object triples.
Well, Microdata /almost/ maps to the RDF model. Microdata doesn't support RDF literal typing, which is basically a fancy way of saying that you can't verify that weights, volumes, speeds, the full range of dates in different calendars, encodings such as chemical compositions, and varying other typed information is expressed cleanly by the Wikipedia contributors.
So, if you wanted to say something like this:
The speed of light is 299792458 m/s.
You would do this in RDFa:
<div about="#light"> The speed of light is <span property="measure:speed" datatype="measure:meters-per-second">299792458</span> m/s. </div>
which would generate the following triple:
<#light> measure:speed "299792458"^^measure:meters-per-second .
AFAIK, there is no way to do the equivalent in Microdata, is there Philip?
Some of you may be asking yourselves "Why is that so important?". The primary concern has to do with data validation. Good RDF vocabularies are built to be able to validate their data and this is important for large sites like Wikipedia to ensure that the data that they're exposing is valid. Since measure:speed's range is measure:meters-per-second, and meters-per-second is presumably a sub-class of xsd:decimal, then a data validator would know that it's expecting some sort of number. So, if a Wikipedia author enters some markup that generates this data:
<#baseball> measure:speed "fast enough to hurt" .
An RDF reasoner would know that not only is the data not typed, but even if it were typed, the value "fast enough to hurt" is not valid. I would expect that this most basic level of data validation would be important to Wikipedia as you want to make sure that contributors are being careful with their markup.
The above is how you would do it in RDFa. Philip, I haven't seen any work related to this in Microdata - have there been any recent developments with regard to data validation in Microdata?
So, we get more-or-less the same number of data items out, but there is a problem. What does "title" mean in the semantic sense? Does it mean "job title" or does it mean "work title"? The term "title" in this case is ambiguous.
No, as long as an item type is used (http://n.whatwg.org/work) there is no ambiguity. This particular item type is defined at http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#l...
Title here "Gives the name of the work." without ambiguity.
This is new! I'm glad this issue was addressed in Microdata as it was one of my criticisms of it when I last read the Microdata spec about six months ago. Looks like that section of the spec was last changed on October 23rd 2009? Do you know when this was put in there, Philip?
What happens when an author forgets to include itemtype? So, if somebody does this:
<div itemscope> <span itemprop="title">Emery Molyneux Terrestrial Globe</span> </div>
There's nothing to ground the "title" property. The way I'm reading the spec, it becomes ambiguous at that point, right?
RDFa is very careful to never let something like this happen... as this data ambiguity results in questionable data that you wouldn't want to pass to a reasoning agent.
Furthermore, for this particular vocabulary the mapping to RDF is defined, as such:
title: http://purl.org/dc/elements/1.1/title author: http://creativecommons.org/ns#attributionName license: http://www.w3.org/1999/xhtml/vocab#license
In other words you express the exact same information as with RDFa but without the mental overhead of triples or mixing multiple vocabularies.
... and with the added danger of expressing ambiguous data. This is not the real danger, though. While data ambiguity is really bad when it comes to data stores, centralized vocabulary management is even worse.
RDFa is built on a concept called "follow your nose", which means that all vocabulary term URLs in RDFa, such as http://purl.org/media/audio#Recording, should be dereference-able and at the end of that URL should be a machine-readable description of the vocabulary term. Preferably, a human-readable description should also exist at that URL.
Dereference http://n.whatwg.org/work and you get a 404 Error. Even worse, the Microdata work vocabulary is hard-coded in the HTML5 specification. If one wanted to extend the vocabulary, you would have to convince the only editor of that specification, who has a track record of being both very easy and very difficult to work with (based on whether or not he agrees with you), that your vocabulary term warrants addition.
There are currently 3 Microdata vocabularies in the spec[2].
To contrast, there are over 250 active RDF vocabularies[3].
That is the true power of decentralized vocabulary development, which is a corner-stone of RDFa. The RDFa community understands that Wikipedia should be in charge of choosing and extending vocabularies since this community has the appropriate domain experts. You are the experts, we are not - and it's important to recognize that in the design of any semantic data expression language.
If Wikipedia agrees that embedding semantics in their pages is of worth to humanity (and I certainly think it is of great worth), then there will come a time that this community will want to develop their own vocabulary. RDFa allows that vocabulary to be developed independently of any standards body and allows this community to have full control of it.
Sure, you could make the argument that Microdata allows RDF to be expressed (as long as you use the complete vocabulary URL), but at that point the Microdata markup is far more cumbersome than the RDFa markup. Similarly, if the goal is to express RDF, that is what RDFa was designed to accomplish.
Philip, could you give us an update on what the WHATWG sees as the publishing process for Microdata vocabularies? For example, if Wikipedia wanted to start expressing royal bloodlines using a vocabulary specific to Wikipedia, how would they go about getting that vocabulary into the HTML5 Microdata specification?
Certainly, but if wiki editors are *able* to do it by hand, then IMHO microdata is much less error-prone.
IMHO, there are ways to shoot yourself in the foot with both Microdata and RDFa - as I've outlined above. I suppose that you could use both and pick which foot you're going to shoot with which technology :), but my suggestion is that nobody should be making such generalized statements - that one is more error-prone than the other.
It's like saying that programming in Python is more error prone than programming in PHP - it depends entirely on the skill of the developer, what you're doing, and many other factors that are out of the hands of language designers.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe
Is Wikipedia using XHTML served as application/xml+xhtml? It seems that RDFa in "XHTML" as deployed only works because consumers pretend that the data is XHTML even though it is served as text/html and treated as such by browsers. I would assume that most pages using RDFa today are neither valid XHTML, nor served with the XHTML MIME type. Any attempts to use browser DOM APIs to access the data will have surprising/confusing results, as HTML doesn't have namespaces but RDFa uses the syntax.
Frankly, this is something that nobody that uses this technology cares about because all they are ever going to see are key-value pairs (Microdata) or triples (RDFa).
This is something that only concerns browser manufacturers and RDFa parser writers. That's why there is a Microdata API, and is going to be an RDFa API. There also exist many RDFa parser implementations to abstract this low-level stuff away.
Both Microdata and RDFa are being designed to operate in "dirty" environments with invalid markup and will work regardless of the MIME type, file extension, markup botching and namespace support across websites and web browsers.
There are a number of RDFa Javascript implementations that work just[4] fine[5] on badly authored/served XHTML documents.
Besides, the Wikipedia community has done a fantastic job of generating valid XHTML:
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Augustus&... http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Walyunga_Nati... http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Nishida_Shune...
The migration to XHTML+RDFa would only require the DOCTYPE to change... which shouldn't be any more difficult than transitioning to HTML5 (or HTML5+RDFa) in the future.
Finally I will note that it is very likely that the microdata DOM APIs will get implemented in browsers, making the semantic data available to both scrapers, to native browser interfaces and to browser extensions such as user JavaScript. As an example, you might see an icon in the address bar for saving events to a calendar, or the license information of an image displayed in the native properties dialog. I stress again that I don't make any promises on behalf of Opera or any other browser vendor, these are just my predictions.
Again, this is exciting news and while I don't think Microdata is the proper solution for the Web, for the same reasons that are outlined above and many more, I'm delighted to hear that Opera is taking in-browser semantic data expression very seriously. How far we have come in just 18 months! :)
-- manu
[1]http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-August/015971.html [2]http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#m... [3]http://prefix.cc/popular/all [4]http://code.google.com/p/rdfquery/ [5]http://code.google.com/p/ubiquity-rdfa/