New subject: RDFa and Microdata in MediaWiki

16 Jan 2010


      Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
...
This is about as long as before, but it might still be wrong.  The
general points I made are still accurate, anyway.
The general points that you made were riddled with technical
inaccuracies, bad advice, and if implemented by the MediaWiki community,
would have resulted in semantic data that would have been ambiguous at
best and erroneous at worst. I don't know if you intended the tone of
your e-mail in the way that I read it, but it came off as purposefully
misleading based on the discussions that both you and I have had as
members of the HTMLWG and WHATWG. I'll address the technical and factual
errors that I believe have been made in your posts as well as provide
alternative guidance.
Just to briefly introduce myself to this community, I do standards work
in a variety of online communities including the Microformats community
(lead editor for hAudio, hMedia and hVideo), contract my expertise to
the music industry and I am also an Invited Expert to the W3C's Semantic
Web Deployment Working Group and co-chair of the upcoming RDFa Working
Group and editor of the HTML5+RDFa spec. The company I founded is
interested in expressing digital content online via semantic languages
and builds open source software for the creation and standardization of
copyright-aware, DRM-free, peer-to-peer networks.
For guidance on how to implement semantic markup in a CMS, we might want
to look at the Drupal Community, who have done a superb job of
integrating RDFa into their platform. They expect several hundred
thousand websites to start using RDFa within the next year or two.
One lesson that we learned during implementation of RDFa in Drupal is
that it is helpful for CMS designers to pre-define vocabularies that are
usable with their CMS systems if manual markup is necessary. Most markup
of both Microdata and RDFa should also be left to the CMS code unless
there is a very good reason to not do so.
If you want to allow manual markup of RDFa, MediaWiki should probably
pre-define at least Dublin Core (used to describe creative works), FOAF
(used to describe people and organizations), and Creative Commons (used
to describe licenses). There are many RDF vocabularies to choose from
and Wikipedia might consider creating a few of their own. Pre-defining
vocabularies would greatly simplify the markup in case someone would
want to
markup something by hand.
Let's revisit Aryeh's example:
Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a
Creative Commons Attribution-Share Alike 3.0 United States License.
The above could be marked up in RDFa, with pre-defined vocabs, like so:
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
   typeof="dctype:StillImage">
<span property="dc:title">Emery Molyneux Terrestrial Globe</span>
by <a rel="cc:attributionUrl" href="http://example.org/bob/"
      property="cc:attributionName">Bob Smith</span>
is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
this would produce the following triples (I haven't expanded the CURIEs
out in order to make it easier to read):
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
   rdf:type
      dctype:StillImage .
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
   dc:title
      "Emery Molyneux Terrestrial Globe" .
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
   cc:attributionName
      "Bob Smith" .
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
   xhv:license
      http://creativecommons.org/licenses/by-sa/3.0/us/ .
So, four pieces of data, which is pretty good considering the
compactness of the HTML code. The Microdata looks like this:
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work">
...
<img
src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480" itemprop="work">
...
<p><span itemprop="title">Emery Molyneux Terrestrial Globe</span>
by <span itemprop="author">Bob Smith</span> is licensed under a <a
itemprop="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
</div>
The compactness of the markup between Microdata and RDFa is more or less
the same in this particular example. There are some things that are
easier to express in Microdata and there are some things that are easier
to express in RDFa. We get the following Microdata out:
type  http://n.whatwg.org/work
work  http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg
title   "Emery Molyneux Terrestrial Globe"
author  "Bob Smith"
license http://creativecommons.org/licenses/by-sa/3.0/us/
So, we get more-or-less the same number of data items out, but there is
a problem. What does "title" mean in the semantic sense? Does it mean
"job title" or does it mean "work title"? The term "title" in this case
is ambiguous.
Concern #1:
Ambiguity is a big problem when it comes to semantics - make sure that
if this community does use Microdata markup, that you fully qualify
terms. It is far easier to be ambiguous in Microdata than it is in RDFa.
So, instead of using
itemprop="title" you should be using
itemprop="http://purl.org/dc/terms/title" - which will inflate the
markup required for Microdata, but is necessary when it comes to
classifying this information accurately for semantic data processors
(such as via SPARQL or higher-level reasoning agents).
Concern #2:
Getting Microdata and RDFa markup correct is easier if there are
templates or if the semantic markup is performed automatically by the
CMS based on a pre-defined form. For example,
http://en.wikipedia.org/wiki/Augustus, note the Infobox on the
right. It would be much better for the RDFa markup to happen
automatically via MediaWiki's template process, than for it to be marked
up by
hand.
Concern #3:
Intentional or not, Aryeh has painted RDFa in a negative light by not
outlining a number of points related to adoption and both RDFa and
Microdata's current status in the HTML Working Group. Adopting either
RDFa or Microdata in an HTML5 document type would be premature
at this time because both have not progressed past the Editors Draft
stage yet. Either is subject to change as far as HTML5 is concerned
and we really don't want you to ship HTML5 features before they've had
a chance to solidify a bit more.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe
to use it for deployment. Google[1] is actively indexing RDFa today as
is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The
Public Library of Science, O'Reilly and the UK Government are
high-profile sites that publish their pages using RDFa. Data formats
such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of
their language. Best Buy saw a 30% traffic increase after publishing
their pages in RDFa using the GoodRelations vocabulary. I'm sure
everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF
as a semantic representation format. dbpedia, which gets its data from
Wikipedia, shows 479 million triples available - so that
should give you folks some idea of the treasure trove of immediately
extractable semantic data we're talking about.
Make no mistake - RDFa has very strong deployment at this point and it
will continue to grow past 100,000+ sites with the upcoming release of
Drupal 7.
Concern #4:
While I can't fault Aryeh's enthusiasm, I am now concerned that there
may be questions in this community that are going unanswered related to
RDFa and Microdata. I hope this will be a deliberate process as it is
easy to get semantic data markup wrong (regardless of the implementation
language - Microformats, Microdata or RDFa).
I hope that those that have an interest in semantic data will discuss
concerns and ask us about the lessons we've learned when implementing
metadata markup. The best place to send RDFa development questions at
the moment is:
http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/
We have a very friendly community that would love to answer any
questions that this community may have related to semantic data markup.
Please do respond to me directly or in this thread if you have lingering
concerns or questions - either the RDFa community or I will do our best
to answer any questions.
-- manu
[1]http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets...
[2]http://developer.yahoo.net/blog/archives/2008/09/searchmonkey_support_for_rd...
[3]http://en.wikipedia.org/wiki/DBpedia#Example
[4]http://en.wikipedia.org/wiki/Freebase_(database)
-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Monarch - Next Generation REST Web Services
http://blog.digitalbazaar.com/2009/12/14/monarch/

Re: [Wikitech-l] RDFa and Microdata in MediaWiki