Before we get into this thread too deeply, for those that are not familiar with semantic data, RDF, RDFa or why any of this stuff applies to Wikipedia, there are two very short videos that explain the concepts at a high-level (apologies, as they're a bit dated):
Intro to the Semantic Web (6 minutes) http://www.youtube.com/watch?v=OGg8A2zfWKg
RDFa Basics (9 minutes) http://www.youtube.com/watch?v=ldl0m-5zLz4
Aryeh Gregor wrote:
What we're talking about (microdata, RDFa, RDF, etc.) is categorically useless for Wikimedia-internal use.
Not necessarily. Javascript can use the RDFa on the page to generate more intuitive interfaces for the page. To give an example - we use the RDFa expressed in our music pages:
http://bitmunk.com/media/6995806
to drive the music player application via Javascript - by parsing the RDFa and feeding the sample URLs to the player.
To give a less than ideal example - Wikipedia could use data on the page to provide interactive discovery of concepts expressed on the page (such as automatically fetching and parsing RDFa on a related page to display more factual information on the current page). The gist of what I'm getting at is to not dismiss the value of having a standardized mechanism for embedded page data - you get to use it internally and externally. The more data you expose, the greater the possibility of somebody figuring out how to use the data in amazing new ways.
Aryeh Gregor wrote:
I'll emphasize from the start that I do *not* think either RDFa or microdata is suitable for dbpedia.org-style content. There's no reason we should put that in the HTML output, where it will take up tons of space and not be useful to HTML consumers (e.g., browsers and search engines).
Placing this data in your HTML documents has a direct impact on browsers and search engines. Browsers can collect triples and use them later to help you answer questions that you may have about a particular subject. Search engines can crawl the HTML and make their indexes more accurate based on semantic data that Wikipedia's pages expose.
RDF/XML, which was largely unsuccessful, was designed to be used for publishing in a dual-stream setup. It was expected that web publishers would publish semantic data beside web page data, just as you've proposed that Wikipedia does, but this proved to be far too difficult for most sites to manage both types of serializations.
Wikipedia is already short on developers, creating a new data stream is just going to exacerbate the problem. Besides, the way Wikipedia seems to be capturing data is via wikitext, not direct database entries. In effect, this community's database exists in the wikitext.
Aryeh Gregor wrote:
On the other other other hand, RDFa 1.1 is under development and looks like it will make major changes, so from that perspective microdata is arguably more stable.
There are new features going into RDFa 1.1, but classifying them as "major" changes makes it sound like RDFa 1.1 isn't going to be backwards-compatible with RDFa 1.0, when it most definitely is going to be backwards-compatible (except possibly for XMLLiterals, which was our bad).
The statement that "Microdata" is more stable because there are new features going into RDFa 1.1 is illogical. For example: just because there are new features going into the next version of Apache doesn't mean that it's any less "stable" for those that are using the current version today.
Aryeh Gregor wrote:
So, it's complicated. :) But from our perspective, I don't think there's a big difference in terms of stability or standard-ness, so I skipped over all this.
There's a huge difference in both stability and standard-ness - XHTML+RDFa is a W3C REC - it's a standard. Microdata and HTML+RDFa aren't even close to becoming a W3C REC. That's very important information for this community to consider.
When do you think that Microdata is going to be a REC at the W3C?
There were changes to the Microdata spec made by Ian less than 12 hours ago (January 18th 2010). If a spec is being actively edited, I don't think it's a good idea to say that it's stable and ready for deployment:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-January/024760.html
You are skipping over some pretty important stuff, kemosabe. :)
Aryeh Gregor wrote:
so converting the microdata graph to RDFa might be easier than the reverse.
Microdata's underlying model is triples as well - Microdata allows the limited expression of RDF. Since RDFa also supports the expression of RDF more formally, you can map Microdata to RDFa easier than you can map RDFa to Microdata (for some value of "easier).
You cannot, however, express RDF fully in Microdata - it is impossible in cases where it matters to Wikipedia (like data-typing).
Microdata doesn't support data typing (via @datatype), data value overriding (via @content), doesn't support URI short-handing via CURIEs (via @xmlns:PREFIX), and it doesn't support anonymous subjects via bnodes (blank nodes). The @datatype thing, @content thing and CURIE thing affects Wikipedia, not supporting bnodes doesn't necessarily impact the Wikipedia community, AFAICT.
Aryeh Gregor wrote:
I also think microdata is much easier to author for people with an HTML (not RDF) background -- template editors tend to have a good working knowledge of HTML, but not web-data technologies. I'd be interested in what Manu (or other RDFa supporters) has to say here.
I do think that Microdata has that going for it - in that the property names such as @itemref, @itemprop, etc. are easier to understand that @about, @datatype, @rel/@rev, and @content.
I'm all for making it easier for web authors to write this stuff, so the consistency of the itemXYZ attributes in Microdata was a good move. We didn't choose to do that for RDFa because we wanted to make the mapping from HTML to RDF explicit. The down-side with that is it requires authors to either have their RDFa autogenerated for them (which is the best thing for RDFa and Microdata), or it requires them to sit through a 10 minute tutorial on RDF (like the video at the top of this e-mail).
I do also think that Microdata has made several really big mistakes that we made in the Microformats community that were corrected in the RDFa community. Namely, not using CURIEs and adding the requirement that all URLs are repeated as many times as they're used. It's fine as an option, but not that great if one has to repeat a URL 50 times in a web page... which Wikipedia will eventually have to do if it is using Microdata.
http://rdfa.info/wiki/Developer-faq#Authoring
The FAQ above, which is a work in progress, is a good introduction to some of the common criticisms against RDFa and the reasoning behind the design decisions, for those that are interested.
The FAQ also addresses the fallacy that RDFa markup is, for real-world data, more verbose than Microdata markup.
Aryeh Gregor wrote:
Neither has more built-in validation than the other. Both allow arbitrary validation. RDFa seems to allow validation to be encoded in a more machine-readable format, but whether that's an advantage at all is debatable.
That's provably false. Microdata vocabulary validation is hard-coded in the specification. Dan Brickly and Ian Hickson had an IRC conversation about just this today[1]. In order to validate Microdata, you must first either convert it to RDF and even if you do, it will fail attempts to validate the literals that should have a datatype. If you want a Microdata vocabulary validator, you have to create one for each vocabulary... just like we had to do in the Microformats community, which some of us now recognize as a catastrophic mistake.
RDFa, via RDF, allows arbitrary data validation - one validator with any number of vocabularies. Microdata does not allow arbitrary validation - there must be one hard-coded validator per vocabulary.
-- manu
[1]http://krijnhoetmer.nl/irc-logs/whatwg/20100118#l-219