Philip Jägenstedt wrote:
I don't suppose that the members of this list
appreciate the epic
Microdata vs. RDFa battle leaking into this mailing list
I wouldn't use such terms to frame the debate. The Microformats,
Microdata and RDFa communities are not "battling" or working against
each other - they're having a very necessary, spirited debate. Clearly,
both communities are influencing the design of the other and clearly we
need to have these discussions in order to make sure that we're creating
the best possible technology for the future of the Web.
More importantly, the reason that all of us are working on this
technology is because we care about how it is used to better humanity.
At least, I hope that's why people are working on this stuff :).
Certainly, we all hold Wikipedia in high regard and want what's best for
this community as well.
It's not /unfortunate/ that we're having the discussion here - it was
inevitable.
I'm delighted by the fact that we're even having this debate. It took
ages to convince the WHAT WG that this was a problem that needed to be
addressed[1] just 18 months ago.
So, we can either grit our teeth and begrudgingly go through the
motions, or we can welcome the debate to come.
I choose to do the latter because I know that all of us will learn
something from it and better understand the requirements for Wikimedia
implementations. What we learn here will further influence guidance
given to future communities, just as integrating RDFa with Drupal has
influenced the advice that we may give to this community.
[ed: Microdata] is really quite intuitive and simple,
with few
surprises.
I agree on the first point - Microdata is pretty intuitive and simple,
with few surprises. Although, I'd say the same for RDFa as well. I think
we tend to forget, though, that Web semantics require a bit of effort to
learn and the audience that is using the technology should be taken into
account when deciding how to expose an authoring environment for the
community.
I don't think that the best approach for Wikipedia is to allow direct
Microdata or RDFa markup. There are already many templates in use at
Wikipedia via Infobox - those templates could be leveraged to
automatically generate RDFa in the same way that
dbpedia.org uses those
templates to generate RDF. The risk this community runs by allowing
arbitrary semantic data markup is that contributors make mistakes
causing half of the semantic data to be corrupted - making the rest of
the data useless.
Neither Microdata nor RDFa come with few surprises for the beginner.
Like all new web technologies, there is a learning curve for both of
them and it's pretty similar since Microdata's design was influenced by
RDFa and Microformats. More about the surprises with each, below.
[ed: Microdata] maps well to the
RDF model if you want it, but doesn't force authors to think in terms
of subject, predicate, object triples.
Well, Microdata /almost/ maps to the RDF model. Microdata doesn't
support RDF literal typing, which is basically a fancy way of saying
that you can't verify that weights, volumes, speeds, the full range of
dates in different calendars, encodings such as chemical compositions,
and varying other typed information is expressed cleanly by the
Wikipedia contributors.
So, if you wanted to say something like this:
The speed of light is 299792458 m/s.
You would do this in RDFa:
<div about="#light">
The speed of light is <span property="measure:speed"
datatype="measure:meters-per-second">299792458</span> m/s.
</div>
which would generate the following triple:
<#light>
measure:speed
"299792458"^^measure:meters-per-second .
AFAIK, there is no way to do the equivalent in Microdata, is there Philip?
Some of you may be asking yourselves "Why is that so important?". The
primary concern has to do with data validation. Good RDF vocabularies
are built to be able to validate their data and this is important for
large sites like Wikipedia to ensure that the data that they're exposing
is valid. Since measure:speed's range is measure:meters-per-second, and
meters-per-second is presumably a sub-class of xsd:decimal, then a data
validator would know that it's expecting some sort of number. So, if a
Wikipedia author enters some markup that generates this data:
<#baseball>
measure:speed
"fast enough to hurt" .
An RDF reasoner would know that not only is the data not typed, but even
if it were typed, the value "fast enough to hurt" is not valid. I would
expect that this most basic level of data validation would be important
to Wikipedia as you want to make sure that contributors are being
careful with their markup.
The above is how you would do it in RDFa. Philip, I haven't seen any
work related to this in Microdata - have there been any recent
developments with regard to data validation in Microdata?
So, we get
more-or-less the same number of data items out, but there is
a problem. What does "title" mean in the semantic sense? Does it mean
"job title" or does it mean "work title"? The term "title"
in this case
is ambiguous.
No, as long as an item type is used (
http://n.whatwg.org/work) there
is no ambiguity. This particular item type is defined at
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#…
Title here "Gives the name of the work." without ambiguity.
This is new! I'm glad this issue was addressed in Microdata as it was
one of my criticisms of it when I last read the Microdata spec about six
months ago. Looks like that section of the spec was last changed on
October 23rd 2009? Do you know when this was put in there, Philip?
What happens when an author forgets to include itemtype? So, if somebody
does this:
<div itemscope>
<span itemprop="title">Emery Molyneux Terrestrial Globe</span>
</div>
There's nothing to ground the "title" property. The way I'm reading the
spec, it becomes ambiguous at that point, right?
RDFa is very careful to never let something like this happen... as this
data ambiguity results in questionable data that you wouldn't want to
pass to a reasoning agent.
Furthermore, for this particular vocabulary the
mapping to RDF is
defined, as such:
title:
http://purl.org/dc/elements/1.1/title
author:
http://creativecommons.org/ns#attributionName
license:
http://www.w3.org/1999/xhtml/vocab#license
In other words you express the exact same information as with RDFa but
without the mental overhead of triples or mixing multiple
vocabularies.
... and with the added danger of expressing ambiguous data. This is not
the real danger, though. While data ambiguity is really bad when it
comes to data stores, centralized vocabulary management is even worse.
RDFa is built on a concept called "follow your nose", which means that
all vocabulary term URLs in RDFa, such as
http://purl.org/media/audio#Recording, should be dereference-able and at
the end of that URL should be a machine-readable description of the
vocabulary term. Preferably, a human-readable description should also
exist at that URL.
Dereference
http://n.whatwg.org/work and you get a 404 Error. Even
worse, the Microdata work vocabulary is hard-coded in the HTML5
specification. If one wanted to extend the vocabulary, you would have to
convince the only editor of that specification, who has a track record
of being both very easy and very difficult to work with (based on
whether or not he agrees with you), that your vocabulary term warrants
addition.
There are currently 3 Microdata vocabularies in the spec[2].
To contrast, there are over 250 active RDF vocabularies[3].
That is the true power of decentralized vocabulary development, which is
a corner-stone of RDFa. The RDFa community understands that Wikipedia
should be in charge of choosing and extending vocabularies since this
community has the appropriate domain experts. You are the experts, we
are not - and it's important to recognize that in the design of any
semantic data expression language.
If Wikipedia agrees that embedding semantics in their pages is of worth
to humanity (and I certainly think it is of great worth), then there
will come a time that this community will want to develop their own
vocabulary. RDFa allows that vocabulary to be developed independently of
any standards body and allows this community to have full control of it.
Sure, you could make the argument that Microdata allows RDF to be
expressed (as long as you use the complete vocabulary URL), but at that
point the Microdata markup is far more cumbersome than the RDFa markup.
Similarly, if the goal is to express RDF, that is what RDFa was designed
to accomplish.
Philip, could you give us an update on what the WHATWG sees as the
publishing process for Microdata vocabularies? For example, if Wikipedia
wanted to start expressing royal bloodlines using a vocabulary specific
to Wikipedia, how would they go about getting that vocabulary into the
HTML5 Microdata specification?
Certainly, but if wiki editors are *able* to do it by
hand, then IMHO
microdata is much less error-prone.
IMHO, there are ways to shoot yourself in the foot with both Microdata
and RDFa - as I've outlined above. I suppose that you could use both and
pick which foot you're going to shoot with which technology :), but my
suggestion is that nobody should be making such generalized statements -
that one is more error-prone than the other.
It's like saying that programming in Python is more error prone than
programming in PHP - it depends entirely on the skill of the developer,
what you're doing, and many other factors that are out of the hands of
language designers.
However -
XHTML1+RDFa is a published W3C Recommendation and it is safe
Is Wikipedia using XHTML served as application/xml+xhtml? It seems
that RDFa in "XHTML" as deployed only works because consumers pretend
that the data is XHTML even though it is served as text/html and
treated as such by browsers. I would assume that most pages using RDFa
today are neither valid XHTML, nor served with the XHTML MIME type.
Any attempts to use browser DOM APIs to access the data will have
surprising/confusing results, as HTML doesn't have namespaces but RDFa
uses the syntax.
Frankly, this is something that nobody that uses this technology cares
about because all they are ever going to see are key-value pairs
(Microdata) or triples (RDFa).
This is something that only concerns browser manufacturers and RDFa
parser writers. That's why there is a Microdata API, and is going to be
an RDFa API. There also exist many RDFa parser implementations to
abstract this low-level stuff away.
Both Microdata and RDFa are being designed to operate in "dirty"
environments with invalid markup and will work regardless of the MIME
type, file extension, markup botching and namespace support across
websites and web browsers.
There are a number of RDFa Javascript implementations that work just[4]
fine[5] on badly authored/served XHTML documents.
Besides, the Wikipedia community has done a fantastic job of generating
valid XHTML:
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Augustus&…
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Walyunga_Nat…
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Nishida_Shun…
The migration to XHTML+RDFa would only require the DOCTYPE to change...
which shouldn't be any more difficult than transitioning to HTML5 (or
HTML5+RDFa) in the future.
Finally I will note that it is very likely that the
microdata DOM APIs
will get implemented in browsers, making the semantic data available
to both scrapers, to native browser interfaces and to browser
extensions such as user JavaScript. As an example, you might see an
icon in the address bar for saving events to a calendar, or the
license information of an image displayed in the native properties
dialog. I stress again that I don't make any promises on behalf of
Opera or any other browser vendor, these are just my predictions.
Again, this is exciting news and while I don't think Microdata is the
proper solution for the Web, for the same reasons that are outlined
above and many more, I'm delighted to hear that Opera is taking
in-browser semantic data expression very seriously. How far we have come
in just 18 months! :)
-- manu
[
1]http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-August/015971.ht…
[
2]http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.htm…
[3]http://prefix.cc/popular/all
[
4]http://code.google.com/p/rdfquery/
[
5]http://code.google.com/p/ubiquity-rdfa/
--
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Monarch - Next Generation REST Web Services
http://blog.digitalbazaar.com/2009/12/14/monarch/