Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
This is about as long as before, but it might still be wrong. The general points I made are still accurate, anyway.
The general points that you made were riddled with technical inaccuracies, bad advice, and if implemented by the MediaWiki community, would have resulted in semantic data that would have been ambiguous at best and erroneous at worst. I don't know if you intended the tone of your e-mail in the way that I read it, but it came off as purposefully misleading based on the discussions that both you and I have had as members of the HTMLWG and WHATWG. I'll address the technical and factual errors that I believe have been made in your posts as well as provide alternative guidance.
Just to briefly introduce myself to this community, I do standards work in a variety of online communities including the Microformats community (lead editor for hAudio, hMedia and hVideo), contract my expertise to the music industry and I am also an Invited Expert to the W3C's Semantic Web Deployment Working Group and co-chair of the upcoming RDFa Working Group and editor of the HTML5+RDFa spec. The company I founded is interested in expressing digital content online via semantic languages and builds open source software for the creation and standardization of copyright-aware, DRM-free, peer-to-peer networks.
For guidance on how to implement semantic markup in a CMS, we might want to look at the Drupal Community, who have done a superb job of integrating RDFa into their platform. They expect several hundred thousand websites to start using RDFa within the next year or two.
One lesson that we learned during implementation of RDFa in Drupal is that it is helpful for CMS designers to pre-define vocabularies that are usable with their CMS systems if manual markup is necessary. Most markup of both Microdata and RDFa should also be left to the CMS code unless there is a very good reason to not do so.
If you want to allow manual markup of RDFa, MediaWiki should probably pre-define at least Dublin Core (used to describe creative works), FOAF (used to describe people and organizations), and Creative Commons (used to describe licenses). There are many RDF vocabularies to choose from and Wikipedia might consider creating a few of their own. Pre-defining vocabularies would greatly simplify the markup in case someone would want to markup something by hand.
Let's revisit Aryeh's example:
Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
The above could be marked up in RDFa, with pre-defined vocabs, like so:
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"> <span property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a rel="cc:attributionUrl" href="http://example.org/bob/" property="cc:attributionName">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
this would produce the following triples (I haven't expanded the CURIEs out in order to make it easier to read):
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> rdf:type dctype:StillImage . <EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> dc:title "Emery Molyneux Terrestrial Globe" . <EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> cc:attributionName "Bob Smith" . <EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> xhv:license http://creativecommons.org/licenses/by-sa/3.0/us/ .
So, four pieces of data, which is pretty good considering the compactness of the HTML code. The Microdata looks like this:
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" itemprop="work"> ... <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div>
The compactness of the markup between Microdata and RDFa is more or less the same in this particular example. There are some things that are easier to express in Microdata and there are some things that are easier to express in RDFa. We get the following Microdata out:
type http://n.whatwg.org/work work http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg title "Emery Molyneux Terrestrial Globe" author "Bob Smith" license http://creativecommons.org/licenses/by-sa/3.0/us/
So, we get more-or-less the same number of data items out, but there is a problem. What does "title" mean in the semantic sense? Does it mean "job title" or does it mean "work title"? The term "title" in this case is ambiguous.
Concern #1:
Ambiguity is a big problem when it comes to semantics - make sure that if this community does use Microdata markup, that you fully qualify terms. It is far easier to be ambiguous in Microdata than it is in RDFa. So, instead of using itemprop="title" you should be using itemprop="http://purl.org/dc/terms/title" - which will inflate the markup required for Microdata, but is necessary when it comes to classifying this information accurately for semantic data processors (such as via SPARQL or higher-level reasoning agents).
Concern #2:
Getting Microdata and RDFa markup correct is easier if there are templates or if the semantic markup is performed automatically by the CMS based on a pre-defined form. For example, http://en.wikipedia.org/wiki/Augustus, note the Infobox on the right. It would be much better for the RDFa markup to happen automatically via MediaWiki's template process, than for it to be marked up by hand.
Concern #3:
Intentional or not, Aryeh has painted RDFa in a negative light by not outlining a number of points related to adoption and both RDFa and Microdata's current status in the HTML Working Group. Adopting either RDFa or Microdata in an HTML5 document type would be premature at this time because both have not progressed past the Editors Draft stage yet. Either is subject to change as far as HTML5 is concerned and we really don't want you to ship HTML5 features before they've had a chance to solidify a bit more.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe to use it for deployment. Google[1] is actively indexing RDFa today as is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The Public Library of Science, O'Reilly and the UK Government are high-profile sites that publish their pages using RDFa. Data formats such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of their language. Best Buy saw a 30% traffic increase after publishing their pages in RDFa using the GoodRelations vocabulary. I'm sure everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF as a semantic representation format. dbpedia, which gets its data from Wikipedia, shows 479 million triples available - so that should give you folks some idea of the treasure trove of immediately extractable semantic data we're talking about.
Make no mistake - RDFa has very strong deployment at this point and it will continue to grow past 100,000+ sites with the upcoming release of Drupal 7.
Concern #4:
While I can't fault Aryeh's enthusiasm, I am now concerned that there may be questions in this community that are going unanswered related to RDFa and Microdata. I hope this will be a deliberate process as it is easy to get semantic data markup wrong (regardless of the implementation language - Microformats, Microdata or RDFa).
I hope that those that have an interest in semantic data will discuss concerns and ask us about the lessons we've learned when implementing metadata markup. The best place to send RDFa development questions at the moment is:
http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/
We have a very friendly community that would love to answer any questions that this community may have related to semantic data markup. Please do respond to me directly or in this thread if you have lingering concerns or questions - either the RDFa community or I will do our best to answer any questions.
-- manu
[1]http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets... [2]http://developer.yahoo.net/blog/archives/2008/09/searchmonkey_support_for_rd... [3]http://en.wikipedia.org/wiki/DBpedia#Example [4]http://en.wikipedia.org/wiki/Freebase_(database)