Re: [Wikitech-l] RDFa and Microdata in MediaWiki

19 Jan 2010


      Before we get into this thread too deeply, for those that are not
familiar with semantic data, RDF, RDFa or why any of this stuff applies
to Wikipedia, there are two very short videos that explain the concepts
at a high-level (apologies, as they're a bit dated):
Intro to the Semantic Web (6 minutes)
http://www.youtube.com/watch?v=OGg8A2zfWKg
RDFa Basics (9 minutes)
http://www.youtube.com/watch?v=ldl0m-5zLz4
Aryeh Gregor wrote:
...
What we're talking about (microdata, RDFa, RDF, etc.) is categorically
useless for Wikimedia-internal use.
Not necessarily. Javascript can use the RDFa on the page to generate
more intuitive interfaces for the page. To give an example - we use the
RDFa expressed in our music pages:
http://bitmunk.com/media/6995806
to drive the music player application via Javascript - by parsing the
RDFa and feeding the sample URLs to the player.
To give a less than ideal example - Wikipedia could use data on the page
to provide interactive discovery of concepts expressed on the page (such
as automatically fetching and parsing RDFa on a related page to display
more factual information on the current page). The gist of what I'm
getting at is to not dismiss the value of having a standardized
mechanism for embedded page data - you get to use it internally and
externally. The more data you expose, the greater the possibility of
somebody figuring out how to use the data in amazing new ways.
Aryeh Gregor wrote:
...
I'll emphasize from the start that I do *not* think either RDFa or
microdata is suitable for dbpedia.org-style content.  There's no
reason we should put that in the HTML output, where it will take up
tons of space and not be useful to HTML consumers (e.g., browsers and
search engines).
Placing this data in your HTML documents has a direct impact on browsers
and search engines. Browsers can collect triples and use them later to
help you answer questions that you may have about a particular subject.
Search engines can crawl the HTML and make their indexes more accurate
based on semantic data that Wikipedia's pages expose.
RDF/XML, which was largely unsuccessful, was designed to be used for
publishing in a dual-stream setup. It was expected that web publishers
would publish semantic data beside web page data, just as you've
proposed that Wikipedia does, but this proved to be far too difficult
for most sites to manage both types of serializations.
Wikipedia is already short on developers, creating a new data stream is
just going to exacerbate the problem. Besides, the way Wikipedia seems
to be capturing data is via wikitext, not direct database entries. In
effect, this community's database exists in the wikitext.
Aryeh Gregor wrote:
...
On the other other other hand, RDFa 1.1 is under
development and looks like it will make major changes, so from that
perspective microdata is arguably more stable.
There are new features going into RDFa 1.1, but classifying them as
"major" changes makes it sound like RDFa 1.1 isn't going to be
backwards-compatible with RDFa 1.0, when it most definitely is going to
be backwards-compatible (except possibly for XMLLiterals, which was our
bad).
The statement that "Microdata" is more stable because there are new
features going into RDFa 1.1 is illogical. For example: just because
there are new features going into the next version of Apache doesn't
mean that it's any less "stable" for those that are using the current
version today.
Aryeh Gregor wrote:
...
So, it's complicated.   :)   But from our perspective, I don't think
there's a big difference in terms of stability or standard-ness, so I
skipped over all this.
There's a huge difference in both stability and standard-ness -
XHTML+RDFa is a W3C REC - it's a standard. Microdata and HTML+RDFa
aren't even close to becoming a W3C REC. That's very important
information for this community to consider.
When do you think that Microdata is going to be a REC at the W3C?
There were changes to the Microdata spec made by Ian less than 12 hours
ago (January 18th 2010). If a spec is being actively edited, I don't
think it's a good idea to say that it's stable and ready for deployment:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-January/024760.html
You are skipping over some pretty important stuff, kemosabe. :)
Aryeh Gregor wrote:
...
so converting the microdata graph to RDFa might be easier
than the reverse.
Microdata's underlying model is triples as well - Microdata allows the
limited expression of RDF. Since RDFa also supports the expression of
RDF more formally, you can map Microdata to RDFa easier than you can map
RDFa to Microdata (for some value of "easier).
You cannot, however, express RDF fully in Microdata - it is impossible
in cases where it matters to Wikipedia (like data-typing).
Microdata doesn't support data typing (via @datatype), data value
overriding (via @content), doesn't support URI short-handing via CURIEs
(via @xmlns:PREFIX), and it doesn't support anonymous subjects via
bnodes (blank nodes). The @datatype thing, @content thing and CURIE
thing affects Wikipedia, not supporting bnodes doesn't necessarily
impact the Wikipedia community, AFAICT.
Aryeh Gregor wrote:
...
I also think microdata is much easier to author for
people with an HTML (not RDF) background -- template editors tend to
have a good working knowledge of HTML, but not web-data technologies.
I'd be interested in what Manu (or other RDFa supporters) has to say
here.
I do think that Microdata has that going for it - in that the property
names such as @itemref, @itemprop, etc. are easier to understand that
@about, @datatype, @rel/@rev, and @content.
I'm all for making it easier for web authors to write this stuff, so the
consistency of the itemXYZ attributes in Microdata was a good move. We
didn't choose to do that for RDFa because we wanted to make the mapping
from HTML to RDF explicit. The down-side with that is it requires
authors to either have their RDFa autogenerated for them (which is the
best thing for RDFa and Microdata), or it requires them to sit through a
10 minute tutorial on RDF (like the video at the top of this e-mail).
I do also think that Microdata has made several really big mistakes that
we made in the Microformats community that were corrected in the RDFa
community. Namely, not using CURIEs and adding the requirement that all
URLs are repeated as many times as they're used. It's fine as an option,
but not that great if one has to repeat a URL 50 times in a web page...
which Wikipedia will eventually have to do if it is using Microdata.
http://rdfa.info/wiki/Developer-faq#Authoring
The FAQ above, which is a work in progress, is a good introduction to
some of the common criticisms against RDFa and the reasoning behind the
design decisions, for those that are interested.
The FAQ also addresses the fallacy that RDFa markup is, for real-world
data, more verbose than Microdata markup.
Aryeh Gregor wrote:
...
Neither has more built-in validation than the other.  Both allow
arbitrary validation.  RDFa seems to allow validation to be encoded in
a more machine-readable format, but whether that's an advantage at all
is debatable.
That's provably false. Microdata vocabulary validation is hard-coded in
the specification. Dan Brickly and Ian Hickson had an IRC conversation
about just this today[1]. In order to validate Microdata, you must first
either convert it to RDF and even if you do, it will fail attempts to
validate the literals that should have a datatype. If you want a
Microdata vocabulary validator, you have to create one for each
vocabulary... just like we had to do in the Microformats community,
which some of us now recognize as a catastrophic mistake.
RDFa, via RDF, allows arbitrary data validation - one validator with any
number of vocabularies. Microdata does not allow arbitrary validation -
there must be one hard-coded validator per vocabulary.
-- manu
[1]http://krijnhoetmer.nl/irc-logs/whatwg/20100118#l-219
-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Monarch - Next Generation REST Web Services
http://blog.digitalbazaar.com/2009/12/14/monarch/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RDFa and Microdata in MediaWiki