On Sat, Jan 16, 2010 at 12:32 AM, Manu Sporny msporny@digitalbazaar.com wrote:
I don't know if you intended the tone of your e-mail in the way that I read it, but it came off as purposefully misleading based on the discussions that both you and I have had as members of the HTMLWG and WHATWG.
I do not claim to be an expert on RDFa, Microdata, or any similar technology. I'd prefer not to have to make a decision here at all, and I've said so. However, it looks like we (MediaWiki) have good reason to use something or other. For the reasons I gave, I think we should choose whatever we believe is more likely to succeed, and failing that, whatever we think is better (e.g., on grounds of aesthetics or intuitiveness). The example markup I gave might not be ideal or accurate, but it serves to give a general idea of what the markup looks like in each case, at least. Thank you for your better RDFa examples -- although it's telling that I was able to get Microdata right on the first try, but apparently it took an RDFa expert to figure out the correct RDFa.
However, as a Wikimedian, I'd like to point you to one of our core guiding principles:
http://en.wikipedia.org/wiki/Wikipedia:Assume_good_faith
One lesson that we learned during implementation of RDFa in Drupal is that it is helpful for CMS designers to pre-define vocabularies that are usable with their CMS systems if manual markup is necessary. Most markup of both Microdata and RDFa should also be left to the CMS code unless there is a very good reason to not do so.
The major use case for us is image licensing on Commons. Currently the license templates are generated "by hand" as in not hardcoded in the software, but actually they're maintained by technically advanced community members, so ordinary users don't see the markup. To use my example image, look at this page:
http://commons.wikimedia.org/wiki/File:EmeryMolyneux-terrestrialglobe-1592-2...
You can see the wikitext source of the page by hitting "view source" (or "edit" if it's unprotected by the time you read this) at the top. The license info is generated by:
{{cc-by-2.0}}
This expands to:
<table cellspacing="8" cellpadding="0" style="width:100%; clear:both; text-align:center; margin:0.5em auto; background-color:#f9f9f9; border:2px solid #e0e0e0; direction: ltr;" class="layouttemplate"> <tr> <td style="width:90px;" rowspan="3"><img alt="w:en:Creative Commons" src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/CC_some_rights_reserved.svg/90px-CC_some_rights_reserved.svg.png" width="90" height="36" /><br /> <img alt="attribution" src="http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Cc-by_new_white.svg/24px-Cc-by_new_white.svg.png" width="24" height="24" /></td> <td>This file is licensed under the <a href="http://en.wikipedia.org/wiki/en:Creative_Commons" class="extiw" title="w:en:Creative Commons">Creative Commons</a> <a href="http://creativecommons.org/licenses/by/2.0/deed.en" class="external text" rel="nofollow">Attribution 2.0 Generic</a> license.</td> <td style="width:90px;" rowspan="3"></td> </tr> <tr style="text-align:center;"> <td></td> </tr> <tr style="text-align:left;"> <td> <dl> <dd>You are free: <ul> <li><b>to share</b> – to copy, distribute and transmit the work</li> <li><b>to remix</b> – to adapt the work</li> </ul> </dd> <dd>Under the following conditions: <ul> <li><b>attribution</b> – You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).</li> </ul> </dd> </dl> </td> </tr> </table>
(Not cutting-edge markup, but oh well.) This is generated by the contents of http://commons.wikimedia.org/wiki/Template:Cc-by-2.0, which was created by the Commons community. Pretty much all boilerplate on Wikimedia projects is created by such templates. So when we enable Microdata and/or RDFa in MediaWiki wikitext, I'd expect it to be used almost exclusively in templates, with few users actually being directly exposed to it. Since the content of MediaWiki pages has no structure other than wikitext, basically we have to allow this in wikitext to make it useful to mark up content.
I'll emphasize from the start that I do *not* think either RDFa or microdata is suitable for dbpedia.org-style content. There's no reason we should put that in the HTML output, where it will take up tons of space and not be useful to HTML consumers (e.g., browsers and search engines). That sort of data should be made available in a separate stream for consumers who want it, in a dedicated format like RDF. That way HTML consumers aren't forced to download loads of useless metadata, and metadata consumers aren't forced to download loads of useless (and expensive-to-generate) HTML. RDFa/Microdata should *only* be used for metadata that's useful to HTML consumers of some kind.
If you want to allow manual markup of RDFa, MediaWiki should probably pre-define at least Dublin Core (used to describe creative works), FOAF (used to describe people and organizations), and Creative Commons (used to describe licenses).
I expect that we'd allow contributors to use whatever vocabularies they'd like. It's a wiki, after all. :)
The above could be marked up in RDFa, with pre-defined vocabs, like so:
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"> <span property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a rel="cc:attributionUrl" href="http://example.org/bob/" property="cc:attributionName">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
. . .
So, four pieces of data, which is pretty good considering the compactness of the HTML code. The Microdata looks like this:
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" itemprop="work"> ... <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div>
The compactness of the markup between Microdata and RDFa is more or less the same in this particular example.
You're comparing apples to oranges here: you included the div and img for Microdata but not RDFa. If you include that for RDFa, and also count the xmlns:, it becomes (correct me if I'm wrong)
[[ <div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480"> ... <p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"><span xmlns:dc="http://purl.org/dc/elements/1.1/" property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a xmlns:cc="http://creativecommons.org/ns#" rel="cc:attributionUrl" href="http://example.org/bob/" property="cc:attributionName">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> ]]
You do have to count the xmlns: somewhere. Even if you put them on the <html>, they still count at least once, and in this case they're only used once on the page, so they deserve to count in full. This is 685 characters. On the other hand, Microdata:
[[ <div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" itemprop="work"> ... <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div> ]]
525 characters. Compare to the original with no extra semantics:
[[ <div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480"> ... <p>Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a <a href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div> ]]
380 characters. So Microdata adds 145 bytes, while RDFa adds 305: 2.1 times as much extra markup. To be fair, you included an extra link to http://example.org/bob/ which wasn't in the original example, but RDFa is still about twice as many bytes.
It's not just bytes, though. It's also complexity. The Microdata is *obvious*. I've never used Microdata before in my life, or RDFa, but somehow I got the Microdata right on the first try, while making several errors in the RDFa. It's not at all obvious what those xmlns: things do, or what those cryptic prefixes mean. Microdata is simpler to understand at first glance for people from an HTML background. Since you've been working with RDF for years, the magnitude of the difference is probably not apparent to you.
Getting Microdata and RDFa markup correct is easier if there are templates or if the semantic markup is performed automatically by the CMS based on a pre-defined form. For example, http://en.wikipedia.org/wiki/Augustus, note the Infobox on the right. It would be much better for the RDFa markup to happen automatically via MediaWiki's template process, than for it to be marked up by hand.
As I noted, the templates are made by hand, by each community. The software just gives the ability to include one page in another with simple substitutions made. The infobox on the Augustus article is http://en.wikipedia.org/wiki/Template:Infobox_royalty, invoked like so:
{{Infobox royalty | name = Caesar Augustus | title = [[Roman Emperor|Emperor]] of the [[Roman Empire]] . . . snip 18 lines . . . | place of death = [[Nola]], [[Italia (Roman Empire)|Italia]], [[Roman Empire]] | place of burial = [[Mausoleum of Augustus]], Rome |}}
The template authors would be the ones to add semantics here, not the software developers. There are a couple orders of magnitude more wiki editors than software developers, so it just wouldn't be practical for the developers to be the ones to assign semantic markup to each and every template. Moreover, as you can tell from the HTML output of the templates, template editors tend to be of the "copy-paste stuff until it works" school of HTML authorship. So you cannot argue here that RDFa is just as good if we abstract away the actual markup. We aren't in a position to do that -- users with little to no knowledge of RDFa or microdata will be editing the raw markup, and that has to be taken into account.
Intentional or not, Aryeh has painted RDFa in a negative light by not outlining a number of points related to adoption and both RDFa and Microdata's current status in the HTML Working Group. Adopting either RDFa or Microdata in an HTML5 document type would be premature at this time because both have not progressed past the Editors Draft stage yet. Either is subject to change as far as HTML5 is concerned and we really don't want you to ship HTML5 features before they've had a chance to solidify a bit more.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe to use it for deployment.
Microdata is also safe to use for deployment. Like other web technologies maintained by the WHATWG, it will not change once it's widely adopted, and Wikipedia adoption would probably count as wide adoption by itself. Note that microdata, like all of HTML5, is at Last Call at the WHATWG, independent of its status as Working Draft in the W3C.
I've asked Hixie how stable Microdata is. Since he's the sole person who decides on changes to HTML5 at the WHATWG, as you know, his answer should be authoritative.
Google[1] is actively indexing RDFa today as is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The Public Library of Science, O'Reilly and the UK Government are high-profile sites that publish their pages using RDFa. Data formats such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of their language. Best Buy saw a 30% traffic increase after publishing their pages in RDFa using the GoodRelations vocabulary. I'm sure everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF as a semantic representation format. dbpedia, which gets its data from Wikipedia, shows 479 million triples available - so that should give you folks some idea of the treasure trove of immediately extractable semantic data we're talking about.
Make no mistake - RDFa has very strong deployment at this point and it will continue to grow past 100,000+ sites with the upcoming release of Drupal 7.
Right -- because microdata is so new. How many of those groups actually considered using microdata? I'd guess roughly none, because in most cases, microdata either didn't exist or was barely known. If microdata is much more intuitive and simpler to use, I'd expect it to win in the long run, say five years from now. RDFa isn't so widely used that it can't be easily defeated by a clearly superior technology.
On Sat, Jan 16, 2010 at 6:37 AM, Philip Jägenstedt philip@foolip.org wrote:
Is Wikipedia using XHTML served as application/xml+xhtml?
No. We're currently using XHTML1.0 served as text/html. I expect us to switch to HTML5 served as text/html (which happens to also be well-formed XML) before we deploy support for either microdata or RDFa.
On Sat, Jan 16, 2010 at 5:16 PM, Manu Sporny msporny@digitalbazaar.com wrote:
You would do this in RDFa:
<div about="#light"> The speed of light is <span property="measure:speed" datatype="measure:meters-per-second">299792458</span> m/s. </div>
which would generate the following triple:
<#light> measure:speed "299792458"^^measure:meters-per-second .
AFAIK, there is no way to do the equivalent in Microdata, is there Philip?
You could define different properties for different units, or allow the data to include unit info directly. Like
<span itemprop="speed">299792458 m/s</span>
and have the format itself define what "m/s" means. I don't see this as a practical issue in MediaWiki, given our use-cases (in particular, emphatically excluding markup of data that's useless to typical HTML consumers).
An RDF reasoner would know that not only is the data not typed, but even if it were typed, the value "fast enough to hurt" is not valid.
A microdata standard would also define what type of data is valid. For instance, from the license vocabulary: "The value must be an absolute URL." "The value must be either an item with the type http://microformats.org/profile/hcard, or text."
What happens when an author forgets to include itemtype?
The same as if an author forgets to include xmlns:. It's not tied to any vocabulary, you have to either guess or ignore it. It's not ambiguous, it's just meaningless. There's no difference to RDFa here, except that RDFa encourages you to link to the profile IDs on the <html> element, which is much more likely to break under copy-paste.
RDFa is built on a concept called "follow your nose", which means that all vocabulary term URLs in RDFa, such as http://purl.org/media/audio#Recording, should be dereference-able and at the end of that URL should be a machine-readable description of the vocabulary term. Preferably, a human-readable description should also exist at that URL.
The perils of using URLs like this are well-known. Just ask the W3C how many hits it gets for DTDs every second. Microdata deliberately and wisely avoids using URLs that machines are intended to dereference. On the other hand, humans can find the info easily:
http://www.google.com/search?q=http://n.whatwg.org/work
I imagine it's meant to resolve to a human-readable spec, though, for the same discoverability as RDFa. It's probably an oversight, I've asked Hixie to clarify.
Philip, could you give us an update on what the WHATWG sees as the publishing process for Microdata vocabularies? For example, if Wikipedia wanted to start expressing royal bloodlines using a vocabulary specific to Wikipedia, how would they go about getting that vocabulary into the HTML5 Microdata specification?
We don't have to. See the spec:
"The item type must be a type defined in an applicable specification." http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#i...
"Applicable specification" links to
"When vendor-neutral extensions to this specification are needed, either this specification can be updated accordingly, or an extension specification can be written that overrides the requirements in this specification. When someone applying this specification to their activities decides that they will recognise the requirements of such an extension specification, it becomes an applicable specification for the purposes of conformance requirements in this specification." http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.h...
Anyone can write their own extension specification -- it becomes "applicable" as soon as anyone decides to use it.
It's like saying that programming in Python is more error prone than programming in PHP - it depends entirely on the skill of the developer, what you're doing, and many other factors that are out of the hands of language designers.
I think you'll find most MediaWiki developers strongly agree that PHP is a terrible language and Python is way better, so maybe that was a bad analogy. :)
Besides, the Wikipedia community has done a fantastic job of generating valid XHTML:
Well, rather, MediaWiki has done a good job there, despite all attempts by the community. ;) Community inputs tag soup, MediaWiki converts to valid XHTML. But that's purely syntactic. You can tell from the extensive usage of tables that Wikipedians don't care about standards or theoretical purity, they just try to get things to work right. That has to be taken into account.
On Sat, Jan 16, 2010 at 5:39 PM, Platonides Platonides@gmail.com wrote:
Perhaps we shouldn't provide the full power of RDF or Microdata yet, and provide instead a extension able to handle a subset, using one or another.
What sort of user-visible syntax would you suggest? We'd still have to use either RDFa or microdata for the actual output, so it doesn't save us much.
On Sat, Jan 16, 2010 at 7:09 PM, Happy-melon happy-melon@live.com wrote:
I know sod all about either of them except what has been posted here, but I see that they're incredibly similar, but just different enough to be incompatible; and I see that they are both horribly difficult for the lay-editor to use. By that I mean that the discussion between "oh this one only requires us to put in two new attributes instead of three" misses the elephant in the room: *both* formats require us to whitelist and start filling our wikitext with the HTML tag that the most iconic piece of wikimarkup, the double brackets, have kept hidden for nine years.
I don't think microdata is harder to use than HTML generally. It's sure a lot easier to use than wikitext template syntax (look at some of those enwiki monstrosities).
and b) even the most careful implementation is going to manifest itself in article wikitext along the lines of ""{{person|John Smith}}, born {{birthdate|12 June 1987}}, was a {{occupation|football player}} for {{organisation|Puddlemere United}}"". Or something like that.
No, I don't think we'd do that at all. We'd add microdata (or RDFa) to things like license templates, and maybe infobox templates. So this would all be hidden behind templates people are already using anyway. The goal is immediately useful metadata like licenses -- we want web crawlers to be able to automatically tell what licenses images are under, say. Abstract stuff like you're marking up shouldn't be provided with the HTML output, and should be input as part of infoboxes (since people do that anyway).
There seem to be two usecases for these systems. First, marking up the 'stuff' that MediaWiki serves: images, copyright links, author links, etc. That requires MW to be able to get hold of the raw data for, for instance, an image license; and that's begging for things like new magic words to put on the image description page, not for enabling either format directly in wikitext. The only reason to do *that*, is to support editors marking up *their own stuff*, and that's where we have problems.
I don't follow. Why can't you just alter {{cc-by-2.0}} or whatever on Commons so it outputs the right markup? MediaWiki doesn't have to do anything beyond allowing the markup to begin with.
TLDR version: jumping on either bandwagon is neither necessary nor sensible, and we should avoid getting drawn into the issue.
I would agree, except that we have an immediate potential use: marking up image licenses so image crawlers know how the images are licensed. Google already hardcodes Wikipedia licenses, apparently, but we should use standards-based machine-readable markup for the benefit of all the other MediaWikis, and any Wikimedia wikis they haven't hardcoded, and Commons too if they change a template name or something and break the scraping, etc. This is why Duesentrieb added the feature. Unless we all agree it's not worth getting into this for the sake of that use-case, we do have to address the issue now.
On Sat, Jan 16, 2010 at 7:13 PM, Manu Sporny msporny@digitalbazaar.com wrote:
Just to be clear - I'm not trying to propose that wikipedia editors should start writing wiki markup interleaved with RDFa/Microdata. Quite the opposite - I think that allowing contributors to hand author RDFa or Microdata would be a very bad idea for Wikipedia. However, it seems like what you are saying is that interleaving HTML like this is not possible anyway - which is a good thing, IMHO.
HTML can be interleaved with wikitext. This is needed because all templates are written in wikitext, for instance. Templates are just chunks of wikitext that can get included in other pages, optionally with some predefined parameters substituted with strings of yet more wikitext. So MediaWiki recursively substitutes all templates (along with other things like conditional constructs) with their wikitext output before evaluating the whole resulting mess as a single wikitext string.
Does anybody have a link to a previous discussion about how to get Wikipedia to output the same data that dbpedia.org is publishing?
As far as I can tell, dbpedia.org just has people manually sift through Wikipedia templates and translate them to RDF. Things like infoboxes naturally lend themselves to users inputting key-value pairs, which can easily be translated to RDF triples. I don't think we should use either microdata or RDFa for this kind of data-mining use-case -- it would be way too much markup and not useful to practically any viewers. People who want to data-mine can use a separate data stream, possibly RDF, possibly autogenerated by MediaWiki. Inline metadata is only ideal for things you want either browsers, search engines, etc. to see.