Re: [Wikitech-l] RDFa and Microdata in MediaWiki

18 Jan 2010


      On Mon, Jan 18, 2010 at 5:34 PM, Manu Sporny msporny@digitalbazaar.com wrote:
...
Not necessarily. Javascript can use the RDFa on the page to generate
more intuitive interfaces for the page.
Sure, but if we're providing the JavaScript, we could do it without
RDFa just as well.  Or can you provide a specific case where you think
it would be easier for MediaWiki to implement some feature via RDFa
(or microdata) than via any other means, not counting communication
with outside software?  Such cases might exist (like if there's a
library to do it that already happens to use RDFa), but they'd be hard
to find and debatable at best, I suspect.
...
Placing this data in your HTML documents has a direct impact on browsers
and search engines. Browsers can collect triples and use them later to
help you answer questions that you may have about a particular subject.
Search engines can crawl the HTML and make their indexes more accurate
based on semantic data that Wikipedia's pages expose.
*Can*.  Yes, in theory.  But do they?  Will they?  If not, then it's
probably not worth the effort to put much work into it so
speculatively, especially if it increases the complexity of editing.
On the other hand, if they do implement feature X if you provide
in-page metadata, would they be equally willing to use a separate RDF
stream?
...
RDF/XML, which was largely unsuccessful, was designed to be used for
publishing in a dual-stream setup. It was expected that web publishers
would publish semantic data beside web page data, just as you've
proposed that Wikipedia does, but this proved to be far too difficult
for most sites to manage both types of serializations.
Is it managing two serializations that was the problem?  Or just that
most sites aren't willing to encode data in the hope that some
consumer somewhere might use it for something in the future?
Personally, I don't think it would be hard at all to maintain multiple
data streams.  The content is all script-generated anyway.  We already
have multiple ways to access the same data or subsets thereof in
various formats, like:
http://en.wikipedia.org/wiki/RDFa
http://en.wikipedia.org/wiki/RDFa?action=raw
http://en.wikipedia.org/w/api.php?action=query&prop=categories&title...
http://en.wikipedia.org/w/api.php?action=query&prop=extlinks&titles=...
http://en.wikipedia.org/w/api.php?action=query&prop=templates&titles...
and many others.  You can append &format=xml to the API queries to get
them in proper XML, or &format=json for JSON, php for PHP array
syntax, yaml for YAML, txt for plaintext, etc.  It would be pretty
simple to write a new API module or query prop or whatever that would
retrieve any type of data from the wikitext of the page and format it
as RDF or whatever else you liked.
...
Wikipedia is already short on developers, creating a new data stream is
just going to exacerbate the problem.
No, it would be pretty simple, in my opinion as a MediaWiki developer.
...
There are new features going into RDFa 1.1, but classifying them as
"major" changes makes it sound like RDFa 1.1 isn't going to be
backwards-compatible with RDFa 1.0, when it most definitely is going to
be backwards-compatible (except possibly for XMLLiterals, which was our
bad).
I apologize if I inadvertently misrepresented the status of RDFa 1.1.
I'm not familiar with RDFa, as I said.
...
There's a huge difference in both stability and standard-ness -
XHTML+RDFa is a W3C REC - it's a standard. Microdata and HTML+RDFa
aren't even close to becoming a W3C REC. That's very important
information for this community to consider.
When do you think that Microdata is going to be a REC at the W3C?
I don't really care about formal status at the W3C.  I care about
providing useful features to users of Wikipedia and other MediaWiki
wikis.  Both RDFa and microdata are stable and usable enough right now
that I think it's appropriate to evaluate them on their technical
merits, not their theoretical spec status.  We use plenty of things
that aren't specified by any conventional standards body, like
rel="canonical", OpenSearch, RSS, and so on.  As long as they're
well-specified de facto standards, it doesn't really matter who
specifies them or what that group labels them -- why should it?
...
There were changes to the Microdata spec made by Ian less than 12 hours
ago (January 18th 2010). If a spec is being actively edited, I don't
think it's a good idea to say that it's stable and ready for deployment:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-January/024760.html
I don't see why not, as long as the editor is committed to avoiding
backward-incompatible changes if possible.  In the unlikely event
something major comes up and there is such a change, it's not the end
of the world -- we can deal with it when it comes up.
...
Microdata doesn't support data typing (via @datatype),
More precisely, it leaves it up to each vocabulary to determine how to
handle data typing.
...
data value overriding (via @content),
<meta itemprop="foo" content="bar">?
...
doesn't support URI short-handing via CURIEs (via @xmlns:PREFIX),
It doesn't require URIs to be used for anything except one itemtype
per item, so this isn't a big deal if you only have a few items of any
given type per page (which would usually be the case for, e.g., image
licenses).
...
and it doesn't support anonymous subjects via bnodes (blank nodes).
I'm not sure what this even means.  :)
...
I do also think that Microdata has made several really big mistakes that
we made in the Microformats community that were corrected in the RDFa
community. Namely, not using CURIEs and adding the requirement that all
URLs are repeated as many times as they're used. It's fine as an option,
but not that great if one has to repeat a URL 50 times in a web page...
which Wikipedia will eventually have to do if it is using Microdata.
Not if we only use it for a few things, like image licenses.  Those
are only displayed on the image description page, so it would be once
per page in that case.  I don't propose we use it for anything where
we'd have fifty items per page.
RDFa seems longer even if you don't count the xmlns: stuff, anyway.
Above, I found a microdata example to add 145 characters to the base
markup, while equivalent RDFa (with xmlns:) added 305 characters.  If
you remove the two xmlns: declarations, I count only 86 characters
saved, so RDFa still adds 219 characters, 50% more than microdata.  So
at best, microdata could save some space, but it's still significantly
shorter than RDFa, at least for this example.
...
That's provably false. Microdata vocabulary validation is hard-coded in
the specification. Dan Brickly and Ian Hickson had an IRC conversation
about just this today[1]. In order to validate Microdata, you must first
either convert it to RDF and even if you do, it will fail attempts to
validate the literals that should have a datatype. If you want a
Microdata vocabulary validator, you have to create one for each
vocabulary... just like we had to do in the Microformats community,
which some of us now recognize as a catastrophic mistake.
RDFa, via RDF, allows arbitrary data validation - one validator with any
number of vocabularies. Microdata does not allow arbitrary validation -
there must be one hard-coded validator per vocabulary.
I think you agreed with what I said.  Both microdata and RDFa allow
validation.  RDFa allows some validation constraints to be expressed
in a standard form, so they can be checked by generic RDFa validators.
 Microdata does not.
But it's not clear to me that this is a disadvantage in practice.
Presumably anything that actually uses the data will necessarily be
smart enough anyway to discard invalid data at no extra cost, so why
not just do it at that stage?  Or, if you're using a very small set of
vocabularies as I propose MediaWiki does, you can assume that
validators will exist for them anyway.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RDFa and Microdata in MediaWiki