Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
This is about as long as before, but it might still be wrong. The general points I made are still accurate, anyway.
The general points that you made were riddled with technical inaccuracies, bad advice, and if implemented by the MediaWiki community, would have resulted in semantic data that would have been ambiguous at best and erroneous at worst. I don't know if you intended the tone of your e-mail in the way that I read it, but it came off as purposefully misleading based on the discussions that both you and I have had as members of the HTMLWG and WHATWG. I'll address the technical and factual errors that I believe have been made in your posts as well as provide alternative guidance.
Just to briefly introduce myself to this community, I do standards work in a variety of online communities including the Microformats community (lead editor for hAudio, hMedia and hVideo), contract my expertise to the music industry and I am also an Invited Expert to the W3C's Semantic Web Deployment Working Group and co-chair of the upcoming RDFa Working Group and editor of the HTML5+RDFa spec. The company I founded is interested in expressing digital content online via semantic languages and builds open source software for the creation and standardization of copyright-aware, DRM-free, peer-to-peer networks.
For guidance on how to implement semantic markup in a CMS, we might want to look at the Drupal Community, who have done a superb job of integrating RDFa into their platform. They expect several hundred thousand websites to start using RDFa within the next year or two.
One lesson that we learned during implementation of RDFa in Drupal is that it is helpful for CMS designers to pre-define vocabularies that are usable with their CMS systems if manual markup is necessary. Most markup of both Microdata and RDFa should also be left to the CMS code unless there is a very good reason to not do so.
If you want to allow manual markup of RDFa, MediaWiki should probably pre-define at least Dublin Core (used to describe creative works), FOAF (used to describe people and organizations), and Creative Commons (used to describe licenses). There are many RDF vocabularies to choose from and Wikipedia might consider creating a few of their own. Pre-defining vocabularies would greatly simplify the markup in case someone would want to markup something by hand.
Let's revisit Aryeh's example:
Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
The above could be marked up in RDFa, with pre-defined vocabs, like so:
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"> <span property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a rel="cc:attributionUrl" href="http://example.org/bob/" property="cc:attributionName">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
this would produce the following triples (I haven't expanded the CURIEs out in order to make it easier to read):
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> rdf:type dctype:StillImage . <EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> dc:title "Emery Molyneux Terrestrial Globe" . <EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> cc:attributionName "Bob Smith" . <EmeryMolyneux-terrestrialglobe-1592-20061127.jpg> xhv:license http://creativecommons.org/licenses/by-sa/3.0/us/ .
So, four pieces of data, which is pretty good considering the compactness of the HTML code. The Microdata looks like this:
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" itemprop="work"> ... <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div>
The compactness of the markup between Microdata and RDFa is more or less the same in this particular example. There are some things that are easier to express in Microdata and there are some things that are easier to express in RDFa. We get the following Microdata out:
type http://n.whatwg.org/work work http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg title "Emery Molyneux Terrestrial Globe" author "Bob Smith" license http://creativecommons.org/licenses/by-sa/3.0/us/
So, we get more-or-less the same number of data items out, but there is a problem. What does "title" mean in the semantic sense? Does it mean "job title" or does it mean "work title"? The term "title" in this case is ambiguous.
Concern #1:
Ambiguity is a big problem when it comes to semantics - make sure that if this community does use Microdata markup, that you fully qualify terms. It is far easier to be ambiguous in Microdata than it is in RDFa. So, instead of using itemprop="title" you should be using itemprop="http://purl.org/dc/terms/title" - which will inflate the markup required for Microdata, but is necessary when it comes to classifying this information accurately for semantic data processors (such as via SPARQL or higher-level reasoning agents).
Concern #2:
Getting Microdata and RDFa markup correct is easier if there are templates or if the semantic markup is performed automatically by the CMS based on a pre-defined form. For example, http://en.wikipedia.org/wiki/Augustus, note the Infobox on the right. It would be much better for the RDFa markup to happen automatically via MediaWiki's template process, than for it to be marked up by hand.
Concern #3:
Intentional or not, Aryeh has painted RDFa in a negative light by not outlining a number of points related to adoption and both RDFa and Microdata's current status in the HTML Working Group. Adopting either RDFa or Microdata in an HTML5 document type would be premature at this time because both have not progressed past the Editors Draft stage yet. Either is subject to change as far as HTML5 is concerned and we really don't want you to ship HTML5 features before they've had a chance to solidify a bit more.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe to use it for deployment. Google[1] is actively indexing RDFa today as is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The Public Library of Science, O'Reilly and the UK Government are high-profile sites that publish their pages using RDFa. Data formats such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of their language. Best Buy saw a 30% traffic increase after publishing their pages in RDFa using the GoodRelations vocabulary. I'm sure everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF as a semantic representation format. dbpedia, which gets its data from Wikipedia, shows 479 million triples available - so that should give you folks some idea of the treasure trove of immediately extractable semantic data we're talking about.
Make no mistake - RDFa has very strong deployment at this point and it will continue to grow past 100,000+ sites with the upcoming release of Drupal 7.
Concern #4:
While I can't fault Aryeh's enthusiasm, I am now concerned that there may be questions in this community that are going unanswered related to RDFa and Microdata. I hope this will be a deliberate process as it is easy to get semantic data markup wrong (regardless of the implementation language - Microformats, Microdata or RDFa).
I hope that those that have an interest in semantic data will discuss concerns and ask us about the lessons we've learned when implementing metadata markup. The best place to send RDFa development questions at the moment is:
http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/
We have a very friendly community that would love to answer any questions that this community may have related to semantic data markup. Please do respond to me directly or in this thread if you have lingering concerns or questions - either the RDFa community or I will do our best to answer any questions.
-- manu
[1]http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets... [2]http://developer.yahoo.net/blog/archives/2008/09/searchmonkey_support_for_rd... [3]http://en.wikipedia.org/wiki/DBpedia#Example [4]http://en.wikipedia.org/wiki/Freebase_(database)
I don't suppose that the members of this list appreciate the epic Microdata vs. RDFa battle leaking into this mailing list, but I want to address a few inaccuracies below.
Introduction: I work for Opera Software and have been active in the WHATWG and W3C HTML WG devloping HTML5 for the last year and a half. I believe I have a good understanding of what browser vendors are likely and not likely to support, although I don't speak for or make any promises on behalf of Opera Software in this mail.
I have also worked on implementing the microdata DOM API in JavaScript, an ongoing experiment at http://gitorious.org/microdatajs and will be able to answer any technical questions about the processing of microdata. In short, I can only say that it is really quite intuitive and simple, with few surprises. It maps well to the RDF model if you want it, but doesn't force authors to think in terms of subject, predicate, object triples.
On Sat, Jan 16, 2010 at 06:32, Manu Sporny msporny@digitalbazaar.com wrote:
Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
[snip]
The compactness of the markup between Microdata and RDFa is more or less the same in this particular example. There are some things that are easier to express in Microdata and there are some things that are easier to express in RDFa. We get the following Microdata out:
type http://n.whatwg.org/work work http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg title "Emery Molyneux Terrestrial Globe" author "Bob Smith" license http://creativecommons.org/licenses/by-sa/3.0/us/
So, we get more-or-less the same number of data items out, but there is a problem. What does "title" mean in the semantic sense? Does it mean "job title" or does it mean "work title"? The term "title" in this case is ambiguous.
No, as long as an item type is used (http://n.whatwg.org/work) there is no ambiguity. This particular item type is defined at http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#l...
Title here "Gives the name of the work." without ambiguity.
Furthermore, for this particular vocabulary the mapping to RDF is defined, as such:
title: http://purl.org/dc/elements/1.1/title author: http://creativecommons.org/ns#attributionName license: http://www.w3.org/1999/xhtml/vocab#license
In other words you express the exact same information as with RDFa but without the mental overhead of triples or mixing multiple vocabularies.
Concern #2:
Getting Microdata and RDFa markup correct is easier if there are templates or if the semantic markup is performed automatically by the CMS based on a pre-defined form. For example, http://en.wikipedia.org/wiki/Augustus, note the Infobox on the right. It would be much better for the RDFa markup to happen automatically via MediaWiki's template process, than for it to be marked up by hand.
Certainly, but if wiki editors are *able* to do it by hand, then IMHO microdata is much less error-prone.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe
Is Wikipedia using XHTML served as application/xml+xhtml? It seems that RDFa in "XHTML" as deployed only works because consumers pretend that the data is XHTML even though it is served as text/html and treated as such by browsers. I would assume that most pages using RDFa today are neither valid XHTML, nor served with the XHTML MIME type. Any attempts to use browser DOM APIs to access the data will have surprising/confusing results, as HTML doesn't have namespaces but RDFa uses the syntax.
Concern #4:
While I can't fault Aryeh's enthusiasm, I am now concerned that there may be questions in this community that are going unanswered related to RDFa and Microdata. I hope this will be a deliberate process as it is easy to get semantic data markup wrong (regardless of the implementation language - Microformats, Microdata or RDFa).
Agreed.
The microdata spec for the curious: http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html
Finally I will note that it is very likely that the microdata DOM APIs will get implemented in browsers, making the semantic data available to both scrapers, to native browser interfaces and to browser extensions such as user JavaScript. As an example, you might see an icon in the address bar for saving events to a calendar, or the license information of an image displayed in the native properties dialog. I stress again that I don't make any promises on behalf of Opera or any other browser vendor, these are just my predictions.
In other goodies, microdata already has a defined mapping to JSON, so dumping all embedded data as JSON via a web interface would be quite trivial and be using the same format that you will get from browsers when they have implemented some of the DOM APIs.
Philip Jägenstedt wrote:
I don't suppose that the members of this list appreciate the epic Microdata vs. RDFa battle leaking into this mailing list
I wouldn't use such terms to frame the debate. The Microformats, Microdata and RDFa communities are not "battling" or working against each other - they're having a very necessary, spirited debate. Clearly, both communities are influencing the design of the other and clearly we need to have these discussions in order to make sure that we're creating the best possible technology for the future of the Web.
More importantly, the reason that all of us are working on this technology is because we care about how it is used to better humanity. At least, I hope that's why people are working on this stuff :). Certainly, we all hold Wikipedia in high regard and want what's best for this community as well.
It's not /unfortunate/ that we're having the discussion here - it was inevitable.
I'm delighted by the fact that we're even having this debate. It took ages to convince the WHAT WG that this was a problem that needed to be addressed[1] just 18 months ago.
So, we can either grit our teeth and begrudgingly go through the motions, or we can welcome the debate to come.
I choose to do the latter because I know that all of us will learn something from it and better understand the requirements for Wikimedia implementations. What we learn here will further influence guidance given to future communities, just as integrating RDFa with Drupal has influenced the advice that we may give to this community.
[ed: Microdata] is really quite intuitive and simple, with few surprises.
I agree on the first point - Microdata is pretty intuitive and simple, with few surprises. Although, I'd say the same for RDFa as well. I think we tend to forget, though, that Web semantics require a bit of effort to learn and the audience that is using the technology should be taken into account when deciding how to expose an authoring environment for the community.
I don't think that the best approach for Wikipedia is to allow direct Microdata or RDFa markup. There are already many templates in use at Wikipedia via Infobox - those templates could be leveraged to automatically generate RDFa in the same way that dbpedia.org uses those templates to generate RDF. The risk this community runs by allowing arbitrary semantic data markup is that contributors make mistakes causing half of the semantic data to be corrupted - making the rest of the data useless.
Neither Microdata nor RDFa come with few surprises for the beginner. Like all new web technologies, there is a learning curve for both of them and it's pretty similar since Microdata's design was influenced by RDFa and Microformats. More about the surprises with each, below.
[ed: Microdata] maps well to the RDF model if you want it, but doesn't force authors to think in terms of subject, predicate, object triples.
Well, Microdata /almost/ maps to the RDF model. Microdata doesn't support RDF literal typing, which is basically a fancy way of saying that you can't verify that weights, volumes, speeds, the full range of dates in different calendars, encodings such as chemical compositions, and varying other typed information is expressed cleanly by the Wikipedia contributors.
So, if you wanted to say something like this:
The speed of light is 299792458 m/s.
You would do this in RDFa:
<div about="#light"> The speed of light is <span property="measure:speed" datatype="measure:meters-per-second">299792458</span> m/s. </div>
which would generate the following triple:
<#light> measure:speed "299792458"^^measure:meters-per-second .
AFAIK, there is no way to do the equivalent in Microdata, is there Philip?
Some of you may be asking yourselves "Why is that so important?". The primary concern has to do with data validation. Good RDF vocabularies are built to be able to validate their data and this is important for large sites like Wikipedia to ensure that the data that they're exposing is valid. Since measure:speed's range is measure:meters-per-second, and meters-per-second is presumably a sub-class of xsd:decimal, then a data validator would know that it's expecting some sort of number. So, if a Wikipedia author enters some markup that generates this data:
<#baseball> measure:speed "fast enough to hurt" .
An RDF reasoner would know that not only is the data not typed, but even if it were typed, the value "fast enough to hurt" is not valid. I would expect that this most basic level of data validation would be important to Wikipedia as you want to make sure that contributors are being careful with their markup.
The above is how you would do it in RDFa. Philip, I haven't seen any work related to this in Microdata - have there been any recent developments with regard to data validation in Microdata?
So, we get more-or-less the same number of data items out, but there is a problem. What does "title" mean in the semantic sense? Does it mean "job title" or does it mean "work title"? The term "title" in this case is ambiguous.
No, as long as an item type is used (http://n.whatwg.org/work) there is no ambiguity. This particular item type is defined at http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#l...
Title here "Gives the name of the work." without ambiguity.
This is new! I'm glad this issue was addressed in Microdata as it was one of my criticisms of it when I last read the Microdata spec about six months ago. Looks like that section of the spec was last changed on October 23rd 2009? Do you know when this was put in there, Philip?
What happens when an author forgets to include itemtype? So, if somebody does this:
<div itemscope> <span itemprop="title">Emery Molyneux Terrestrial Globe</span> </div>
There's nothing to ground the "title" property. The way I'm reading the spec, it becomes ambiguous at that point, right?
RDFa is very careful to never let something like this happen... as this data ambiguity results in questionable data that you wouldn't want to pass to a reasoning agent.
Furthermore, for this particular vocabulary the mapping to RDF is defined, as such:
title: http://purl.org/dc/elements/1.1/title author: http://creativecommons.org/ns#attributionName license: http://www.w3.org/1999/xhtml/vocab#license
In other words you express the exact same information as with RDFa but without the mental overhead of triples or mixing multiple vocabularies.
... and with the added danger of expressing ambiguous data. This is not the real danger, though. While data ambiguity is really bad when it comes to data stores, centralized vocabulary management is even worse.
RDFa is built on a concept called "follow your nose", which means that all vocabulary term URLs in RDFa, such as http://purl.org/media/audio#Recording, should be dereference-able and at the end of that URL should be a machine-readable description of the vocabulary term. Preferably, a human-readable description should also exist at that URL.
Dereference http://n.whatwg.org/work and you get a 404 Error. Even worse, the Microdata work vocabulary is hard-coded in the HTML5 specification. If one wanted to extend the vocabulary, you would have to convince the only editor of that specification, who has a track record of being both very easy and very difficult to work with (based on whether or not he agrees with you), that your vocabulary term warrants addition.
There are currently 3 Microdata vocabularies in the spec[2].
To contrast, there are over 250 active RDF vocabularies[3].
That is the true power of decentralized vocabulary development, which is a corner-stone of RDFa. The RDFa community understands that Wikipedia should be in charge of choosing and extending vocabularies since this community has the appropriate domain experts. You are the experts, we are not - and it's important to recognize that in the design of any semantic data expression language.
If Wikipedia agrees that embedding semantics in their pages is of worth to humanity (and I certainly think it is of great worth), then there will come a time that this community will want to develop their own vocabulary. RDFa allows that vocabulary to be developed independently of any standards body and allows this community to have full control of it.
Sure, you could make the argument that Microdata allows RDF to be expressed (as long as you use the complete vocabulary URL), but at that point the Microdata markup is far more cumbersome than the RDFa markup. Similarly, if the goal is to express RDF, that is what RDFa was designed to accomplish.
Philip, could you give us an update on what the WHATWG sees as the publishing process for Microdata vocabularies? For example, if Wikipedia wanted to start expressing royal bloodlines using a vocabulary specific to Wikipedia, how would they go about getting that vocabulary into the HTML5 Microdata specification?
Certainly, but if wiki editors are *able* to do it by hand, then IMHO microdata is much less error-prone.
IMHO, there are ways to shoot yourself in the foot with both Microdata and RDFa - as I've outlined above. I suppose that you could use both and pick which foot you're going to shoot with which technology :), but my suggestion is that nobody should be making such generalized statements - that one is more error-prone than the other.
It's like saying that programming in Python is more error prone than programming in PHP - it depends entirely on the skill of the developer, what you're doing, and many other factors that are out of the hands of language designers.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe
Is Wikipedia using XHTML served as application/xml+xhtml? It seems that RDFa in "XHTML" as deployed only works because consumers pretend that the data is XHTML even though it is served as text/html and treated as such by browsers. I would assume that most pages using RDFa today are neither valid XHTML, nor served with the XHTML MIME type. Any attempts to use browser DOM APIs to access the data will have surprising/confusing results, as HTML doesn't have namespaces but RDFa uses the syntax.
Frankly, this is something that nobody that uses this technology cares about because all they are ever going to see are key-value pairs (Microdata) or triples (RDFa).
This is something that only concerns browser manufacturers and RDFa parser writers. That's why there is a Microdata API, and is going to be an RDFa API. There also exist many RDFa parser implementations to abstract this low-level stuff away.
Both Microdata and RDFa are being designed to operate in "dirty" environments with invalid markup and will work regardless of the MIME type, file extension, markup botching and namespace support across websites and web browsers.
There are a number of RDFa Javascript implementations that work just[4] fine[5] on badly authored/served XHTML documents.
Besides, the Wikipedia community has done a fantastic job of generating valid XHTML:
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Augustus&... http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Walyunga_Nati... http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Nishida_Shune...
The migration to XHTML+RDFa would only require the DOCTYPE to change... which shouldn't be any more difficult than transitioning to HTML5 (or HTML5+RDFa) in the future.
Finally I will note that it is very likely that the microdata DOM APIs will get implemented in browsers, making the semantic data available to both scrapers, to native browser interfaces and to browser extensions such as user JavaScript. As an example, you might see an icon in the address bar for saving events to a calendar, or the license information of an image displayed in the native properties dialog. I stress again that I don't make any promises on behalf of Opera or any other browser vendor, these are just my predictions.
Again, this is exciting news and while I don't think Microdata is the proper solution for the Web, for the same reasons that are outlined above and many more, I'm delighted to hear that Opera is taking in-browser semantic data expression very seriously. How far we have come in just 18 months! :)
-- manu
[1]http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-August/015971.html [2]http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#m... [3]http://prefix.cc/popular/all [4]http://code.google.com/p/rdfquery/ [5]http://code.google.com/p/ubiquity-rdfa/
Philip wrote:
Certainly, but if wiki editors are *able* to do it by hand, then IMHO microdata is much less error-prone.
Manu Sporny wrote:
I don't think that the best approach for Wikipedia is to allow direct Microdata or RDFa markup. There are already many templates in use at Wikipedia via Infobox - those templates could be leveraged to automatically generate RDFa in the same way that dbpedia.org uses those templates to generate RDF. The risk this community runs by allowing arbitrary semantic data markup is that contributors make mistakes causing half of the semantic data to be corrupted - making the rest of the data useless.
Both of you seem to think that wikipedia editors would start placing RDF/Microdata interleaved with wiki markup. I don't think that could ever happen. The "direct markup" would be inserted into infoboxes (which are themselves wikitext, although they can get quite complex).
Perhaps we shouldn't provide the full power of RDF or Microdata yet, and provide instead a extension able to handle a subset, using one or another.
(long text about if wikipedia XHTML is served as application/xml+xhtml and why it doesn't matter)
Besides, the Wikipedia community has done a fantastic job of generating valid XHTML:
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Augustus&... http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Walyunga_Nati... http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Nishida_Shune...
The migration to XHTML+RDFa would only require the DOCTYPE to change... which shouldn't be any more difficult than transitioning to HTML5 (or HTML5+RDFa) in the future.
It's expected to provide good xhtml (the output is being passed by tidy), but nonetheless it sometimes still fail. And there're IE users, too. There is also a switch on MediaWiki for using HTML5 instead of XHTML.
2010/1/16 Platonides Platonides@gmail.com:
Both of you seem to think that wikipedia editors would start placing RDF/Microdata interleaved with wiki markup. I don't think that could ever happen. The "direct markup" would be inserted into infoboxes (which are themselves wikitext, although they can get quite complex).
Something deep inside the plumbing of a template would be the place for this.
- d.
Platonides wrote:
Both of you seem to think that wikipedia editors would start placing RDF/Microdata interleaved with wiki markup. I don't think that could ever happen. The "direct markup" would be inserted into infoboxes (which are themselves wikitext, although they can get quite complex).
Just to be clear - I'm not trying to propose that wikipedia editors should start writing wiki markup interleaved with RDFa/Microdata. Quite the opposite - I think that allowing contributors to hand author RDFa or Microdata would be a very bad idea for Wikipedia. However, it seems like what you are saying is that interleaving HTML like this is not possible anyway - which is a good thing, IMHO.
Perhaps we shouldn't provide the full power of RDF or Microdata yet, and provide instead a extension able to handle a subset, using one or another.
XHTML1+RDFa is certainly ready for prime-time, so it would be up to this community to decide if it should go that route and put it into the core distribution or have it implemented as an extension.
I think our preference would be that it is implemented as an extension first and in such a way as to make it very easy to integrate it into MediaWiki core once all of the bugs are worked out in the extension.
Does anybody have a link to a previous discussion about how to get Wikipedia to output the same data that dbpedia.org is publishing?
David Gerard wrote:
Something deep inside the plumbing of a template would be the place for this.
I agree.
-- manu
I could see the flames rising at the start of this thread, so thank you both for steering away from them.
Essentially we have a format war here, in which one or other format will win and the other will go extinct. It might be being fueled by altruism rather than capitalism, and that's brilliant, but VHS and Betamax are watching from the wings. I know sod all about either of them except what has been posted here, but I see that they're incredibly similar, but just different enough to be incompatible; and I see that they are both horribly difficult for the lay-editor to use. By that I mean that the discussion between "oh this one only requires us to put in two new attributes instead of three" misses the elephant in the room: *both* formats require us to whitelist and start filling our wikitext with the HTML tag that the most iconic piece of wikimarkup, the double brackets, have kept hidden for nine years. The reason we brought in that now-ubiquitous syntax hasn't changed: the damn thing was too difficult for the layman to understand and use.
We do, without a doubt, need to implement this metadata-capture in MediaWiki somehow, but we need to do it not only in a way that the majority of people can use and understand, but in a way which doesn't make wikitext even more complicated for everyone. If either syntax were enabled, yes it would end up at the bottom of a template stack, but a) that's not going to do anything to ensure that the tags aren't being misused elsewhere, and b) even the most careful implementation is going to manifest itself in article wikitext along the lines of ""{{person|John Smith}}, born {{birthdate|12 June 1987}}, was a {{occupation|football player}} for {{organisation|Puddlemere United}}"". Or something like that. If we encourage editors to go the whole hog on this, we might as well install SMW.
There seem to be two usecases for these systems. First, marking up the 'stuff' that MediaWiki serves: images, copyright links, author links, etc. That requires MW to be able to get hold of the raw data for, for instance, an image license; and that's begging for things like new magic words to put on the image description page, not for enabling either format directly in wikitext. The only reason to do *that*, is to support editors marking up *their own stuff*, and that's where we have problems.
I think that it would be foolish beyond belief to encourage editors to divert their volunteer time to implementing a system that could turn out to be totally anachronistic within two years; and while I think it's a laudable long-term goal I think it would thus be very silly to let editors insert *either* format directly into wikitext at this point, or for a good year to come. By far the top priority should be implementing structures by which MediaWiki can *collect* semantic data. If we implement a {{COPYRIGHT:...}} parser function, or a metadata form, or (as I've been musing over for a while) a Category-esque system that wasn't based on wikitext and so which could have a fine-grained permissions interface; we create a feature that is useful whatever happens in the metadata world. We can implement RDFa with that data, microdata, both, neither or something else entirely. We could certainly expose it through our own API. Whatever happens, editor work is not wasted.
TLDR version: jumping on either bandwagon is neither necessary nor sensible, and we should avoid getting drawn into the issue. Implementing either of the proposed methods in raw wikitext actively defeats one of the purposes of MediaWiki: to make it as easy for anyone to edit stuff. It would need to be carefully thought through, and there's no point putting that effort in until we know which format has come out on top. Adding metadata to MW's own stuff is much easier, but its groundwork should be format-independent.
in this world of economic crisis, £0.02 seems to go quite a long way :-D
--HM
Trying my best to limit length of reply.
On Sat, Jan 16, 2010 at 23:16, Manu Sporny msporny@digitalbazaar.com wrote:
Philip Jägenstedt wrote:
[ed: Microdata] maps well to the RDF model if you want it, but doesn't force authors to think in terms of subject, predicate, object triples.
Well, Microdata /almost/ maps to the RDF model. Microdata doesn't support RDF literal typing, which is basically a fancy way of saying that you can't verify that weights, volumes, speeds, the full range of dates in different calendars, encodings such as chemical compositions, and varying other typed information is expressed cleanly by the Wikipedia contributors.
So, if you wanted to say something like this:
The speed of light is 299792458 m/s.
You would do this in RDFa:
<div about="#light"> The speed of light is <span property="measure:speed" datatype="measure:meters-per-second">299792458</span> m/s. </div>
which would generate the following triple:
<#light> measure:speed "299792458"^^measure:meters-per-second .
AFAIK, there is no way to do the equivalent in Microdata, is there Philip?
The datatype is a part of the vocabulary, if you want to validate your data you validate it against the vocabulary, not what the author claims. For examples, you'll see that the vCard vocabulary defines its own datatypes: http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#v...
Allowing mixing different types (like m/s and km/h) seems risky, but is one of the things that exist in the RDF model that can't be expressed directly using microdata, that is correct.
The above is how you would do it in RDFa. Philip, I haven't seen any work related to this in Microdata - have there been any recent developments with regard to data validation in Microdata?
There is nothing like automatic validation, your software has understand a certain vocabulary to be able to say if the data conforms to the constraints of that particular vocabulary. (I don't know if this is any different from the RDF model or if RDF software is able to "automatically" learn how to validate measure:meters-per-second from just seeing the string "measure:meters-per-second".)
So, we get more-or-less the same number of data items out, but there is a problem. What does "title" mean in the semantic sense? Does it mean "job title" or does it mean "work title"? The term "title" in this case is ambiguous.
No, as long as an item type is used (http://n.whatwg.org/work) there is no ambiguity. This particular item type is defined at http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#l...
Title here "Gives the name of the work." without ambiguity.
This is new! I'm glad this issue was addressed in Microdata as it was one of my criticisms of it when I last read the Microdata spec about six months ago. Looks like that section of the spec was last changed on October 23rd 2009? Do you know when this was put in there, Philip?
Originally microdata used item="http://n.whatwg.org/work", but even then there was no ambiguity about what a particular property meant.
What happens when an author forgets to include itemtype? So, if somebody does this:
<div itemscope> <span itemprop="title">Emery Molyneux Terrestrial Globe</span> </div>
There's nothing to ground the "title" property. The way I'm reading the spec, it becomes ambiguous at that point, right?
Like Aryeh said it's not ambiguous, it's meaningless. Microdata allows typeless items for site-private use (much like data-*), but such data *should not* be used by external parties and is in fact ignored by the RDF extraction algorithm.
... and with the added danger of expressing ambiguous data. This is not the real danger, though. While data ambiguity is really bad when it comes to data stores, centralized vocabulary management is even worse.
Anyone can make up a vocabulary, just point to it in itemtype. The WHATWG maintains a few core vocabularies, but I expect that new vocabularies will be developed independently by communities like microformats.
Philip, could you give us an update on what the WHATWG sees as the publishing process for Microdata vocabularies? For example, if Wikipedia wanted to start expressing royal bloodlines using a vocabulary specific to Wikipedia, how would they go about getting that vocabulary into the HTML5 Microdata specification?
No process, just do it :)
Finally I will note that it is very likely that the microdata DOM APIs will get implemented in browsers, making the semantic data available to both scrapers, to native browser interfaces and to browser extensions such as user JavaScript. As an example, you might see an icon in the address bar for saving events to a calendar, or the license information of an image displayed in the native properties dialog. I stress again that I don't make any promises on behalf of Opera or any other browser vendor, these are just my predictions.
Again, this is exciting news and while I don't think Microdata is the proper solution for the Web, for the same reasons that are outlined above and many more, I'm delighted to hear that Opera is taking in-browser semantic data expression very seriously. How far we have come in just 18 months! :)
I will stress again that I don't speak for Opera in these matters, but I do think that microdata in many ways bridges the gap between the "browsable web" and the "semantic web" (actually, there is only one web). Browsers already do add some UI features based on the data in documents (apart from rendering), e.g. exposing RSS feeds in the address bar or navigating to the next page based on rel="next". Microdata isn't really new in that regard, it just adds some new data for browsers to expose.
2010/1/16 Manu Sporny msporny@digitalbazaar.com:
I don't know if you intended the tone of your e-mail in the way that I read it, but it came off as purposefully misleading based on the discussions that both you and I have had as members of the HTMLWG and WHATWG.
[...]
We have a very friendly community
- d.
On Sat, Jan 16, 2010 at 12:32 AM, Manu Sporny msporny@digitalbazaar.com wrote:
I don't know if you intended the tone of your e-mail in the way that I read it, but it came off as purposefully misleading based on the discussions that both you and I have had as members of the HTMLWG and WHATWG.
I do not claim to be an expert on RDFa, Microdata, or any similar technology. I'd prefer not to have to make a decision here at all, and I've said so. However, it looks like we (MediaWiki) have good reason to use something or other. For the reasons I gave, I think we should choose whatever we believe is more likely to succeed, and failing that, whatever we think is better (e.g., on grounds of aesthetics or intuitiveness). The example markup I gave might not be ideal or accurate, but it serves to give a general idea of what the markup looks like in each case, at least. Thank you for your better RDFa examples -- although it's telling that I was able to get Microdata right on the first try, but apparently it took an RDFa expert to figure out the correct RDFa.
However, as a Wikimedian, I'd like to point you to one of our core guiding principles:
http://en.wikipedia.org/wiki/Wikipedia:Assume_good_faith
One lesson that we learned during implementation of RDFa in Drupal is that it is helpful for CMS designers to pre-define vocabularies that are usable with their CMS systems if manual markup is necessary. Most markup of both Microdata and RDFa should also be left to the CMS code unless there is a very good reason to not do so.
The major use case for us is image licensing on Commons. Currently the license templates are generated "by hand" as in not hardcoded in the software, but actually they're maintained by technically advanced community members, so ordinary users don't see the markup. To use my example image, look at this page:
http://commons.wikimedia.org/wiki/File:EmeryMolyneux-terrestrialglobe-1592-2...
You can see the wikitext source of the page by hitting "view source" (or "edit" if it's unprotected by the time you read this) at the top. The license info is generated by:
{{cc-by-2.0}}
This expands to:
<table cellspacing="8" cellpadding="0" style="width:100%; clear:both; text-align:center; margin:0.5em auto; background-color:#f9f9f9; border:2px solid #e0e0e0; direction: ltr;" class="layouttemplate"> <tr> <td style="width:90px;" rowspan="3"><img alt="w:en:Creative Commons" src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/CC_some_rights_reserved.svg/90px-CC_some_rights_reserved.svg.png" width="90" height="36" /><br /> <img alt="attribution" src="http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Cc-by_new_white.svg/24px-Cc-by_new_white.svg.png" width="24" height="24" /></td> <td>This file is licensed under the <a href="http://en.wikipedia.org/wiki/en:Creative_Commons" class="extiw" title="w:en:Creative Commons">Creative Commons</a> <a href="http://creativecommons.org/licenses/by/2.0/deed.en" class="external text" rel="nofollow">Attribution 2.0 Generic</a> license.</td> <td style="width:90px;" rowspan="3"></td> </tr> <tr style="text-align:center;"> <td></td> </tr> <tr style="text-align:left;"> <td> <dl> <dd>You are free: <ul> <li><b>to share</b> – to copy, distribute and transmit the work</li> <li><b>to remix</b> – to adapt the work</li> </ul> </dd> <dd>Under the following conditions: <ul> <li><b>attribution</b> – You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).</li> </ul> </dd> </dl> </td> </tr> </table>
(Not cutting-edge markup, but oh well.) This is generated by the contents of http://commons.wikimedia.org/wiki/Template:Cc-by-2.0, which was created by the Commons community. Pretty much all boilerplate on Wikimedia projects is created by such templates. So when we enable Microdata and/or RDFa in MediaWiki wikitext, I'd expect it to be used almost exclusively in templates, with few users actually being directly exposed to it. Since the content of MediaWiki pages has no structure other than wikitext, basically we have to allow this in wikitext to make it useful to mark up content.
I'll emphasize from the start that I do *not* think either RDFa or microdata is suitable for dbpedia.org-style content. There's no reason we should put that in the HTML output, where it will take up tons of space and not be useful to HTML consumers (e.g., browsers and search engines). That sort of data should be made available in a separate stream for consumers who want it, in a dedicated format like RDF. That way HTML consumers aren't forced to download loads of useless metadata, and metadata consumers aren't forced to download loads of useless (and expensive-to-generate) HTML. RDFa/Microdata should *only* be used for metadata that's useful to HTML consumers of some kind.
If you want to allow manual markup of RDFa, MediaWiki should probably pre-define at least Dublin Core (used to describe creative works), FOAF (used to describe people and organizations), and Creative Commons (used to describe licenses).
I expect that we'd allow contributors to use whatever vocabularies they'd like. It's a wiki, after all. :)
The above could be marked up in RDFa, with pre-defined vocabs, like so:
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"> <span property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a rel="cc:attributionUrl" href="http://example.org/bob/" property="cc:attributionName">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
. . .
So, four pieces of data, which is pretty good considering the compactness of the HTML code. The Microdata looks like this:
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" itemprop="work"> ... <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div>
The compactness of the markup between Microdata and RDFa is more or less the same in this particular example.
You're comparing apples to oranges here: you included the div and img for Microdata but not RDFa. If you include that for RDFa, and also count the xmlns:, it becomes (correct me if I'm wrong)
[[ <div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480"> ... <p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"><span xmlns:dc="http://purl.org/dc/elements/1.1/" property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a xmlns:cc="http://creativecommons.org/ns#" rel="cc:attributionUrl" href="http://example.org/bob/" property="cc:attributionName">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> ]]
You do have to count the xmlns: somewhere. Even if you put them on the <html>, they still count at least once, and in this case they're only used once on the page, so they deserve to count in full. This is 685 characters. On the other hand, Microdata:
[[ <div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480" itemprop="work"> ... <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div> ]]
525 characters. Compare to the original with no extra semantics:
[[ <div id="bodyContent"> ... <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" width="640" height="480"> ... <p>Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a <a href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p> </div> ]]
380 characters. So Microdata adds 145 bytes, while RDFa adds 305: 2.1 times as much extra markup. To be fair, you included an extra link to http://example.org/bob/ which wasn't in the original example, but RDFa is still about twice as many bytes.
It's not just bytes, though. It's also complexity. The Microdata is *obvious*. I've never used Microdata before in my life, or RDFa, but somehow I got the Microdata right on the first try, while making several errors in the RDFa. It's not at all obvious what those xmlns: things do, or what those cryptic prefixes mean. Microdata is simpler to understand at first glance for people from an HTML background. Since you've been working with RDF for years, the magnitude of the difference is probably not apparent to you.
Getting Microdata and RDFa markup correct is easier if there are templates or if the semantic markup is performed automatically by the CMS based on a pre-defined form. For example, http://en.wikipedia.org/wiki/Augustus, note the Infobox on the right. It would be much better for the RDFa markup to happen automatically via MediaWiki's template process, than for it to be marked up by hand.
As I noted, the templates are made by hand, by each community. The software just gives the ability to include one page in another with simple substitutions made. The infobox on the Augustus article is http://en.wikipedia.org/wiki/Template:Infobox_royalty, invoked like so:
{{Infobox royalty | name = Caesar Augustus | title = [[Roman Emperor|Emperor]] of the [[Roman Empire]] . . . snip 18 lines . . . | place of death = [[Nola]], [[Italia (Roman Empire)|Italia]], [[Roman Empire]] | place of burial = [[Mausoleum of Augustus]], Rome |}}
The template authors would be the ones to add semantics here, not the software developers. There are a couple orders of magnitude more wiki editors than software developers, so it just wouldn't be practical for the developers to be the ones to assign semantic markup to each and every template. Moreover, as you can tell from the HTML output of the templates, template editors tend to be of the "copy-paste stuff until it works" school of HTML authorship. So you cannot argue here that RDFa is just as good if we abstract away the actual markup. We aren't in a position to do that -- users with little to no knowledge of RDFa or microdata will be editing the raw markup, and that has to be taken into account.
Intentional or not, Aryeh has painted RDFa in a negative light by not outlining a number of points related to adoption and both RDFa and Microdata's current status in the HTML Working Group. Adopting either RDFa or Microdata in an HTML5 document type would be premature at this time because both have not progressed past the Editors Draft stage yet. Either is subject to change as far as HTML5 is concerned and we really don't want you to ship HTML5 features before they've had a chance to solidify a bit more.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe to use it for deployment.
Microdata is also safe to use for deployment. Like other web technologies maintained by the WHATWG, it will not change once it's widely adopted, and Wikipedia adoption would probably count as wide adoption by itself. Note that microdata, like all of HTML5, is at Last Call at the WHATWG, independent of its status as Working Draft in the W3C.
I've asked Hixie how stable Microdata is. Since he's the sole person who decides on changes to HTML5 at the WHATWG, as you know, his answer should be authoritative.
Google[1] is actively indexing RDFa today as is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The Public Library of Science, O'Reilly and the UK Government are high-profile sites that publish their pages using RDFa. Data formats such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of their language. Best Buy saw a 30% traffic increase after publishing their pages in RDFa using the GoodRelations vocabulary. I'm sure everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF as a semantic representation format. dbpedia, which gets its data from Wikipedia, shows 479 million triples available - so that should give you folks some idea of the treasure trove of immediately extractable semantic data we're talking about.
Make no mistake - RDFa has very strong deployment at this point and it will continue to grow past 100,000+ sites with the upcoming release of Drupal 7.
Right -- because microdata is so new. How many of those groups actually considered using microdata? I'd guess roughly none, because in most cases, microdata either didn't exist or was barely known. If microdata is much more intuitive and simpler to use, I'd expect it to win in the long run, say five years from now. RDFa isn't so widely used that it can't be easily defeated by a clearly superior technology.
On Sat, Jan 16, 2010 at 6:37 AM, Philip Jägenstedt philip@foolip.org wrote:
Is Wikipedia using XHTML served as application/xml+xhtml?
No. We're currently using XHTML1.0 served as text/html. I expect us to switch to HTML5 served as text/html (which happens to also be well-formed XML) before we deploy support for either microdata or RDFa.
On Sat, Jan 16, 2010 at 5:16 PM, Manu Sporny msporny@digitalbazaar.com wrote:
You would do this in RDFa:
<div about="#light"> The speed of light is <span property="measure:speed" datatype="measure:meters-per-second">299792458</span> m/s. </div>
which would generate the following triple:
<#light> measure:speed "299792458"^^measure:meters-per-second .
AFAIK, there is no way to do the equivalent in Microdata, is there Philip?
You could define different properties for different units, or allow the data to include unit info directly. Like
<span itemprop="speed">299792458 m/s</span>
and have the format itself define what "m/s" means. I don't see this as a practical issue in MediaWiki, given our use-cases (in particular, emphatically excluding markup of data that's useless to typical HTML consumers).
An RDF reasoner would know that not only is the data not typed, but even if it were typed, the value "fast enough to hurt" is not valid.
A microdata standard would also define what type of data is valid. For instance, from the license vocabulary: "The value must be an absolute URL." "The value must be either an item with the type http://microformats.org/profile/hcard, or text."
What happens when an author forgets to include itemtype?
The same as if an author forgets to include xmlns:. It's not tied to any vocabulary, you have to either guess or ignore it. It's not ambiguous, it's just meaningless. There's no difference to RDFa here, except that RDFa encourages you to link to the profile IDs on the <html> element, which is much more likely to break under copy-paste.
RDFa is built on a concept called "follow your nose", which means that all vocabulary term URLs in RDFa, such as http://purl.org/media/audio#Recording, should be dereference-able and at the end of that URL should be a machine-readable description of the vocabulary term. Preferably, a human-readable description should also exist at that URL.
The perils of using URLs like this are well-known. Just ask the W3C how many hits it gets for DTDs every second. Microdata deliberately and wisely avoids using URLs that machines are intended to dereference. On the other hand, humans can find the info easily:
http://www.google.com/search?q=http://n.whatwg.org/work
I imagine it's meant to resolve to a human-readable spec, though, for the same discoverability as RDFa. It's probably an oversight, I've asked Hixie to clarify.
Philip, could you give us an update on what the WHATWG sees as the publishing process for Microdata vocabularies? For example, if Wikipedia wanted to start expressing royal bloodlines using a vocabulary specific to Wikipedia, how would they go about getting that vocabulary into the HTML5 Microdata specification?
We don't have to. See the spec:
"The item type must be a type defined in an applicable specification." http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#i...
"Applicable specification" links to
"When vendor-neutral extensions to this specification are needed, either this specification can be updated accordingly, or an extension specification can be written that overrides the requirements in this specification. When someone applying this specification to their activities decides that they will recognise the requirements of such an extension specification, it becomes an applicable specification for the purposes of conformance requirements in this specification." http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.h...
Anyone can write their own extension specification -- it becomes "applicable" as soon as anyone decides to use it.
It's like saying that programming in Python is more error prone than programming in PHP - it depends entirely on the skill of the developer, what you're doing, and many other factors that are out of the hands of language designers.
I think you'll find most MediaWiki developers strongly agree that PHP is a terrible language and Python is way better, so maybe that was a bad analogy. :)
Besides, the Wikipedia community has done a fantastic job of generating valid XHTML:
Well, rather, MediaWiki has done a good job there, despite all attempts by the community. ;) Community inputs tag soup, MediaWiki converts to valid XHTML. But that's purely syntactic. You can tell from the extensive usage of tables that Wikipedians don't care about standards or theoretical purity, they just try to get things to work right. That has to be taken into account.
On Sat, Jan 16, 2010 at 5:39 PM, Platonides Platonides@gmail.com wrote:
Perhaps we shouldn't provide the full power of RDF or Microdata yet, and provide instead a extension able to handle a subset, using one or another.
What sort of user-visible syntax would you suggest? We'd still have to use either RDFa or microdata for the actual output, so it doesn't save us much.
On Sat, Jan 16, 2010 at 7:09 PM, Happy-melon happy-melon@live.com wrote:
I know sod all about either of them except what has been posted here, but I see that they're incredibly similar, but just different enough to be incompatible; and I see that they are both horribly difficult for the lay-editor to use. By that I mean that the discussion between "oh this one only requires us to put in two new attributes instead of three" misses the elephant in the room: *both* formats require us to whitelist and start filling our wikitext with the HTML tag that the most iconic piece of wikimarkup, the double brackets, have kept hidden for nine years.
I don't think microdata is harder to use than HTML generally. It's sure a lot easier to use than wikitext template syntax (look at some of those enwiki monstrosities).
and b) even the most careful implementation is going to manifest itself in article wikitext along the lines of ""{{person|John Smith}}, born {{birthdate|12 June 1987}}, was a {{occupation|football player}} for {{organisation|Puddlemere United}}"". Or something like that.
No, I don't think we'd do that at all. We'd add microdata (or RDFa) to things like license templates, and maybe infobox templates. So this would all be hidden behind templates people are already using anyway. The goal is immediately useful metadata like licenses -- we want web crawlers to be able to automatically tell what licenses images are under, say. Abstract stuff like you're marking up shouldn't be provided with the HTML output, and should be input as part of infoboxes (since people do that anyway).
There seem to be two usecases for these systems. First, marking up the 'stuff' that MediaWiki serves: images, copyright links, author links, etc. That requires MW to be able to get hold of the raw data for, for instance, an image license; and that's begging for things like new magic words to put on the image description page, not for enabling either format directly in wikitext. The only reason to do *that*, is to support editors marking up *their own stuff*, and that's where we have problems.
I don't follow. Why can't you just alter {{cc-by-2.0}} or whatever on Commons so it outputs the right markup? MediaWiki doesn't have to do anything beyond allowing the markup to begin with.
TLDR version: jumping on either bandwagon is neither necessary nor sensible, and we should avoid getting drawn into the issue.
I would agree, except that we have an immediate potential use: marking up image licenses so image crawlers know how the images are licensed. Google already hardcodes Wikipedia licenses, apparently, but we should use standards-based machine-readable markup for the benefit of all the other MediaWikis, and any Wikimedia wikis they haven't hardcoded, and Commons too if they change a template name or something and break the scraping, etc. This is why Duesentrieb added the feature. Unless we all agree it's not worth getting into this for the sake of that use-case, we do have to address the issue now.
On Sat, Jan 16, 2010 at 7:13 PM, Manu Sporny msporny@digitalbazaar.com wrote:
Just to be clear - I'm not trying to propose that wikipedia editors should start writing wiki markup interleaved with RDFa/Microdata. Quite the opposite - I think that allowing contributors to hand author RDFa or Microdata would be a very bad idea for Wikipedia. However, it seems like what you are saying is that interleaving HTML like this is not possible anyway - which is a good thing, IMHO.
HTML can be interleaved with wikitext. This is needed because all templates are written in wikitext, for instance. Templates are just chunks of wikitext that can get included in other pages, optionally with some predefined parameters substituted with strings of yet more wikitext. So MediaWiki recursively substitutes all templates (along with other things like conditional constructs) with their wikitext output before evaluating the whole resulting mess as a single wikitext string.
Does anybody have a link to a previous discussion about how to get Wikipedia to output the same data that dbpedia.org is publishing?
As far as I can tell, dbpedia.org just has people manually sift through Wikipedia templates and translate them to RDF. Things like infoboxes naturally lend themselves to users inputting key-value pairs, which can easily be translated to RDF triples. I don't think we should use either microdata or RDFa for this kind of data-mining use-case -- it would be way too much markup and not useful to practically any viewers. People who want to data-mine can use a separate data stream, possibly RDF, possibly autogenerated by MediaWiki. Inline metadata is only ideal for things you want either browsers, search engines, etc. to see.
On Sat, Jan 16, 2010 at 8:25 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Microdata is also safe to use for deployment. Like other web technologies maintained by the WHATWG, it will not change once it's widely adopted, and Wikipedia adoption would probably count as wide adoption by itself. Note that microdata, like all of HTML5, is at Last Call at the WHATWG, independent of its status as Working Draft in the W3C.
I've asked Hixie how stable Microdata is. Since he's the sole person who decides on changes to HTML5 at the WHATWG, as you know, his answer should be authoritative.
[100116 20:35:42] <AryehGregor> I assume that if Wikipedia starts using it on a large scale and we do a MediaWiki release and such, though, you won't change it after that and break all our content, right? [100116 20:35:56] <Hixie> correct
So it's certainly stable enough for us to use.
Hello,
The discussion so far has been about biographical data on Wikipedia and licensing data on Commons, but other projects have their own needs for it.
Wikisource, especially, is in desperate need of metadata. We have some 140,000 pages on the English wiki alone that represent poems, chapters, tables of contents, and so forth. These are essentially disorganized: we have human-usable templates and categories, but there's really no good way to find works besides searching their titles.
A few years ago we combined our metadata templates into two standard templates, {{header}} (for works) and {{author}} (for authors). Every single page already provides metadata to these templates, so implementing a metadata format for machine use is trivial once it is available on MediaWiki. We *really* want this; it would allow us to index our jumbled pile of works and authors in all sorts of very useful and interesting ways. Just a few example are author search and autocompletion (we currently list works manually), finding works by genre and year and subject and so forth, searching work descriptions, and distinguishing works from subpages.
Both formats have their own advantages and disadvantages. Microdata's simplicity is a significant advantage, but RDFa's built-in validation is also nice. Whichever format we choose, we'll make it all work behind the scenes in the murky depths of our templates. But it would be nice if you'd include creative works, authors, navigation, and indexes in the equation. There's more here than biographies and image licenses. :)
On Sat, Jan 16, 2010 at 9:07 PM, Jesse (Pathoschild) pathoschild@gmail.com wrote:
Wikisource, especially, is in desperate need of metadata. We have some 140,000 pages on the English wiki alone that represent poems, chapters, tables of contents, and so forth. These are essentially disorganized: we have human-usable templates and categories, but there's really no good way to find works besides searching their titles.
A few years ago we combined our metadata templates into two standard templates, {{header}} (for works) and {{author}} (for authors). Every single page already provides metadata to these templates, so implementing a metadata format for machine use is trivial once it is available on MediaWiki. We *really* want this; it would allow us to index our jumbled pile of works and authors in all sorts of very useful and interesting ways. Just a few example are author search and autocompletion (we currently list works manually), finding works by genre and year and subject and so forth, searching work descriptions, and distinguishing works from subpages.
What we're talking about (microdata, RDFa, RDF, etc.) is categorically useless for Wikimedia-internal use. The only use that any of this metadata stuff has to us is exposing info to *non*-Wikimedia agents. For internal use, we can make up our own custom formats and use plain old database queries much more easily than resorting to any standard format.
For instance, we have lots of images on Commons under various licenses. *We* know which license each is under, because we use MediaWiki's category system. But *other* people (e.g., search engines) also want to know what licenses our images are under. So for this we want a standard format like microdata or RDFa, so they don't have to keep track of our internal data formats.
What Wikisource needs here is a MediaWiki extension. Standard metadata languages are not going to help at all. If no one is willing to write an extension for it now, no one will be willing with RDF support -- since that won't make the job the slightest bit easier.
Both formats have their own advantages and disadvantages. Microdata's simplicity is a significant advantage, but RDFa's built-in validation is also nice.
Neither has more built-in validation than the other. Both allow arbitrary validation. RDFa seems to allow validation to be encoded in a more machine-readable format, but whether that's an advantage at all is debatable. HTML5 does not provide a DTD, XML Schema, or any other machine-readable language description, for good reason.
On Sat, Jan 16, 2010 at 9:37 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
What we're talking about (microdata, RDFa, RDF, etc.) is categorically useless for Wikimedia-internal use. The only use that any of this metadata stuff has to us is exposing info to *non*-Wikimedia agents. For internal use, we can make up our own custom formats and use plain old database queries much more easily than resorting to any standard format. [...] For instance, we have lots of images on Commons under various licenses. *We* know which license each is under, because we use MediaWiki's category system.
Unfortunately, categories and database queries are inadequate for our needs. Someone can indeed navigate to Categories::Works::Works by genre::Non-fiction::Governmental::Biographies::Ancient biographies, and they'll find all 5 pages that someone thought to categorize to this depth. But if someone hopes to find our 1872 American biographies, they are going to be sorely disappointed.
Metadata, whether a standard or internal format, allows machines to extract this data from template output and store it in a database for human use. If you want 1872 American biographies mentioning a Willard, just fill in the year, location, and description fields, and check off the relevant genres from the database. This will return a list of actual works that match the exact criteria given, not subpages or mid-text false matches which are the best we can get now.
If we simply extend MediaWiki to support metadata for works or authors, the metadata is limited to these types and fields. Public metadata can be extended and parsed in any way the local community or our content users feel useful. Users can add their own metadata (translators? publishers? work licenses?) to templates, and add their own tools and databases to the collection.
This is also not possible with database queries, since the metadata is not provided to the software except as part of the wiki text. It's conceivable to extract it directly from the wiki text of a wiki dump, but this would be horrendously complex given the number of different options and combinations. It's possible to use an internal Wikimedia format, but this would be useless outside Wikimedia.
There is very little difference between internal and external use; it's no easier for a Wikisource editor to find those 1872 American biographies. Editors are also users. Categories are inadequate beyond the simplest one-dimensional criteria.
So, these metadata formats are definitely *not* useless for internal community use.
On Sat, Jan 16, 2010 at 10:07 PM, Jesse (Pathoschild) pathoschild@gmail.com wrote:
Unfortunately, categories and database queries are inadequate for our needs. Someone can indeed navigate to Categories::Works::Works by genre::Non-fiction::Governmental::Biographies::Ancient biographies, and they'll find all 5 pages that someone thought to categorize to this depth. But if someone hopes to find our 1872 American biographies, they are going to be sorely disappointed.
You can do this with database queries fine -- there are already several different toolserver tools that will do category intersections for you, and a couple extensions. In fact, bog-standard search will do it for you, although AFAIK only for categories added literally (not by templates):
http://en.wikipedia.org/w/index.php?title=Special:Search&redirs=1&se...
It wouldn't be that hard to allow template-added categories too. I assume you have categories like "books published in America", "books published in 1872", and "biographies" -- if not, you can easily add them via your templates (although that wouldn't work right now with standard search AFAIK, it would work with things like CatScan).
If we simply extend MediaWiki to support metadata for works or authors, the metadata is limited to these types and fields. Public metadata can be extended and parsed in any way the local community or our content users feel useful.
Sure, but this is not internal use, so not relevant to my last post.
This is also not possible with database queries, since the metadata is not provided to the software except as part of the wiki text.
It is if you use categories. It would also be possible to hack up some tool to store all template parameter-value pairs, which are strikingly similar to the idea of RDFa triples: (article, template+parameter name, parameter value).
There is very little difference between internal and external use; it's no easier for a Wikisource editor to find those 1872 American biographies. Editors are also users.
By "internal use" I mean "use by software designed only to work with MediaWiki", not "use by Wikimedia users". Standards are only needed if we want to be useful to software that's also meant to work with other sites. That way, the software can use the same code to process both our site and the other sites, since all output the same standard markup. If the software is only processing MediaWiki sites to begin with, then standard markup is useless. (Unless it happens to expose convenient libraries, like with XML or such -- but that's probably not the case here.)
So, these metadata formats are definitely *not* useless for internal community use.
No, they really are. It's almost certainly more work for us to use a standard of any kind than to make up our own internal format, so if we only care about internal use, bothering with standards is counterproductive. The real use-cases are for external users only.
* Aryeh Gregor Simetrical+wikilist@gmail.com [Sat, 16 Jan 2010 23:06:06 -0500]:
You can do this with database queries fine -- there are already several different toolserver tools that will do category intersections for you, and a couple extensions. In fact, bog-standard search will do it for you, although AFAIK only for categories added literally (not by templates):
http://en.wikipedia.org/w/index.php?title=Special:Search&redirs=1&se...
Intersections probably are inefficient when someone needs a range search between, let's say 1944 and 1965. SMW has probably right approach that something sequental and numerical like date, mass, speed should not be a Category but a Property.. Also, it's a bit sad that so many toolserver tools are standalone and are not a part of MediaWiki distribution. That tool should be a part of Special:Search.
It wouldn't be that hard to allow template-added categories too. I assume you have categories like "books published in America", "books published in 1872", and "biographies" -- if not, you can easily add them via your templates (although that wouldn't work right now with standard search AFAIK, it would work with things like CatScan).
When comes to subcategories, I always wondered why they have to include the name of parent category: http://en.wikipedia.org/wiki/Category:Books The word "Books" is repeated many times through the nested categories, although we already know these are the "Books". However, this brings the problem with "de-parenting" of categories, which is hard to resolve, because the Categories are the part of source text. Perhaps, a full category name and a shorted subcategory alias, defined at NS_CATEGORY pages. Dmitriy
Jesse (Pathoschild <pathoschild <at> gmail.com> writes:
If we simply extend MediaWiki to support metadata for works or authors, the metadata is limited to these types and fields. Public metadata can be extended and parsed in any way the local community or our content users feel useful. Users can add their own metadata (translators? publishers? work licenses?) to templates, and add their own tools and databases to the collection.
Hi Jesse,
the use you may need seems to be a lot like what Semantic MediaWiki is offering. I don't know if Wikisource would consider it, but adding user-curated metadata using a user-generated vocabulary, and being able to query it internally (as well as exporting it externally) is pretty much what we do.
If you have any questions on it, feel free to contact me.
Cheers, denny
On Sun, Jan 17, 2010 at 9:20 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
the use you may need seems to be a lot like what Semantic MediaWiki is offering. I don't know if Wikisource would consider it, but adding user-curated metadata using a user-generated vocabulary, and being able to query it internally (as well as exporting it externally) is pretty much what we do.
The major problem with SMW in the past has been, AFAIK, that it's an enormous amount of code written totally separately from MediaWiki by different people, and would need to be reviewed in its entirety by someone like Tim Starling before it could be enabled on any Wikimedia site. I recall Tim looking briefly at the code and taking a few minutes to find an XSS exploit. There are also likely to be major performance issues scaling to Wikipedia (correct me if I'm wrong). So I wouldn't bet on any progress here anytime soon, especially since we're way behind on reviewing even existing core code, let alone large new extensions.
A much more probable method of progress would be to try committing more modest features incrementally to core, or to small special-purpose extensions. I don't think it would be very hard at all to have the API output a machine-readable summary of the template parameters used on a given page. I might do that today as a proof-of-concept. If I do, then someone familiar with RDF and PHP could probably write a fairly simple patch to turn this code into RDF output. From there it would be pretty simple to write a maintenance script to output RDF for the template parameters on all pages on a wiki, and we could see about incorporating that into the regular Wikipedia data dump.
Notably, this doesn't try to actually use the data on the wiki, so should have no scalability issues. It should also be small enough to put in core with no problems, so all MW wikis could be outputting RDF for their template parameters out of the box. My understanding is that it's expected that data providers may output RDF in whatever format is convenient to them, and someone will have to write OWL to turn this into more conventional formats. But we can output the raw data reasonably easily, at least.
On Jan 17, 2010, at 16:11, Aryeh Gregor wrote:
On Sun, Jan 17, 2010 at 9:20 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
the use you may need seems to be a lot like what Semantic MediaWiki is offering. I don't know if Wikisource would consider it, but adding user-curated metadata using a user-generated vocabulary, and being able to query it internally (as well as exporting it externally) is pretty much what we do.
The major problem with SMW in the past has been, AFAIK, that it's an enormous amount of code written totally separately from MediaWiki by different people, and would need to be reviewed in its entirety by someone like Tim Starling before it could be enabled on any Wikimedia site. I recall Tim looking briefly at the code and taking a few minutes to find an XSS exploit. There are also likely to be major performance issues scaling to Wikipedia (correct me if I'm wrong). So I wouldn't bet on any progress here anytime soon, especially since we're way behind on reviewing even existing core code, let alone large new extensions.
I was not talking about WIkipedia -- even though our scalability tests suggest that it could work there, but it is hard to say in advance without testing on the actual WMF server farm. I am merely talking about Wikisource, and wondering if it could be used to solve the problems they have, right now.
Furthermore, the code has had some peer review by now, it is used by sites like Wikia. Our code is getting smaller and we are incorporating comments. It would be great to get further reviews.
So, as said, I am only talking about Wikisource. I think it could be a viable solution for them.
Notably, this doesn't try to actually use the data on the wiki, so should have no scalability issues. It should also be small enough to put in core with no problems, so all MW wikis could be outputting RDF for their template parameters out of the box. My understanding is that it's expected that data providers may output RDF in whatever format is convenient to them, and someone will have to write OWL to turn this into more conventional formats. But we can output the raw data reasonably easily, at least.
Since for the requirements of Wikisource it seems that it would be helpful that the wiki itself stores and uses the data (e.g. give me all the chapters in their order of that book written by X between 1920 and 1940), I was wondering if an extension that does that could be helpful. It is obviously and entirely possible to have the metadata be generated by the RDFa-extension, the metadata be harvested by an external tool, the queries be processed by an external tool, and the result be uploaded to the wiki. It may be a bit easier for Wikisource if the wiki did it, since it could potentially enable more users to perform these tasks.
In the case of Wikisource I'd further suggest to switch off the additional annotation syntax of SMW on go for a modus were the templates do the whole annotation, but that again is an implementation detail that has to be decided by the Wikisource community.
Cheers, denny
On Sun, Jan 17, 2010 at 11:32 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
I was not talking about WIkipedia -- even though our scalability tests suggest that it could work there, but it is hard to say in advance without testing on the actual WMF server farm. I am merely talking about Wikisource, and wondering if it could be used to solve the problems they have, right now.
The code still must undergo security review to be enabled on any Wikimedia site. As I said, we don't even have enough reviewers right now to review core code, let alone large new extensions, so it's really not likely in the near future. Even small extensions would probably have a hard time getting enabled right now.
On Sun, Jan 17, 2010 at 11:40 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
Intersections probably are inefficient when someone needs a range search between, let's say 1944 and 1965. SMW has probably right approach that something sequental and numerical like date, mass, speed should not be a Category but a Property..
Yes, that would be awkward to phrase in Lucene search. The point is, anyway, that enabling something like SMW (probably with fewer features) is orthogonal to RDFa/microdata/RDF support -- the extension could incidentally output RDF or whatnot, but it doesn't matter for internal use.
Also, it's a bit sad that so many toolserver tools are standalone and are not a part of MediaWiki distribution. That tool should be a part of Special:Search.
Most toolserver tool authors just don't bother applying for commit access for whatever reason. Most tools also either perform badly and/or would need to be rewritten to meet coding standards. Toolserver roots routinely have to kill processes for using up unreasonable amounts of resources.
When comes to subcategories, I always wondered why they have to include the name of parent category: http://en.wikipedia.org/wiki/Category:Books The word "Books" is repeated many times through the nested categories, although we already know these are the "Books".
Because categories in MediaWiki form a directed graph, not a tree. Categories don't have a unique parent. Whether this is good or bad is debatable.
Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
[...]
Also, it's a bit sad that so many toolserver tools are standalone and are not a part of MediaWiki distribution. That tool should be a part of Special:Search.
Most toolserver tool authors just don't bother applying for commit access for whatever reason. Most tools also either perform badly and/or would need to be rewritten to meet coding standards. Toolserver roots routinely have to kill processes for using up unreasonable amounts of resources. [...]
How many of those tools would stand a chance as extensions of not being disabled in $wgMiserMode? If such a small fea- ture as the namespace filter in Special:Linksearch risks server meltdown (bug #10593), I doubt more complex searches are on the horizon.
Tim
The point is, anyway, that enabling something like SMW (probably with fewer features) is orthogonal to RDFa/microdata/RDF support -- the extension could incidentally output RDF or whatnot, but it doesn't matter for internal use.
Perhaps the right approach for us would be to have "some" syntax for providing this info, and then generating html5 microdata and/or rdfa into the rendered html, write the triple into a smw backend store, and provide rdf/xml/n3/whatever output via the api.
there are three aspects here: specify, store, output. perhaps we should look at them separately.
-- daniel
On 18/01/10 14:46, Daniel Kinzler wrote:
The point is, anyway, that enabling something like SMW (probably with fewer features) is orthogonal to RDFa/microdata/RDF support -- the extension could incidentally output RDF or whatnot, but it doesn't matter for internal use.
Perhaps the right approach for us would be to have "some" syntax for providing this info, and then generating html5 microdata and/or rdfa into the rendered html, write the triple into a smw backend store, and provide rdf/xml/n3/whatever output via the api.
there are three aspects here: specify, store, output. perhaps we should look at them separately.
-- daniel
I definitely wouldn't recommend a flat triples store as the only storage representation.
Based on past experience with just such a system, while it's formally semantically equivalent to higher-level descriptions, it's definitely much harder to munge, because you have to reverse-engineer all the reification that was needed to flatten the data into triples in order to be able to see the higher-level patterns; it's much easier to just store the higher-level description in the obvious natural way, and generate the triples representation, and any other metadata output needed, from that.
-- Neil
Neil Harris schrieb:
I definitely wouldn't recommend a flat triples store as the only storage representation.
Based on past experience with just such a system, while it's formally semantically equivalent to higher-level descriptions, it's definitely much harder to munge, because you have to reverse-engineer all the reification that was needed to flatten the data into triples in order to be able to see the higher-level patterns; it's much easier to just store the higher-level description in the obvious natural way, and generate the triples representation, and any other metadata output needed, from that.
True if you know the "obvious natural way" in andvance and can design a database schema for it. I don't think we can do that. We'll need a generic abstraction for stoiring structured (meta) data, so it can be used for all the different kinds of data we will get.
On the other hand, I see the problems with triple stores, especially wrt reification. Triples make this very clumsy, and it's something we will need once we cant to map infoboxes. We need it because a lot of the statements given in infoboxes are qualified: they have a source, a unit of measurement, an error margin, a point in time or some other meta-statement attached. I don't have a good solution for this right now, but I do think we should consider it.
-- daniel
Before we get into this thread too deeply, for those that are not familiar with semantic data, RDF, RDFa or why any of this stuff applies to Wikipedia, there are two very short videos that explain the concepts at a high-level (apologies, as they're a bit dated):
Intro to the Semantic Web (6 minutes) http://www.youtube.com/watch?v=OGg8A2zfWKg
RDFa Basics (9 minutes) http://www.youtube.com/watch?v=ldl0m-5zLz4
Aryeh Gregor wrote:
What we're talking about (microdata, RDFa, RDF, etc.) is categorically useless for Wikimedia-internal use.
Not necessarily. Javascript can use the RDFa on the page to generate more intuitive interfaces for the page. To give an example - we use the RDFa expressed in our music pages:
http://bitmunk.com/media/6995806
to drive the music player application via Javascript - by parsing the RDFa and feeding the sample URLs to the player.
To give a less than ideal example - Wikipedia could use data on the page to provide interactive discovery of concepts expressed on the page (such as automatically fetching and parsing RDFa on a related page to display more factual information on the current page). The gist of what I'm getting at is to not dismiss the value of having a standardized mechanism for embedded page data - you get to use it internally and externally. The more data you expose, the greater the possibility of somebody figuring out how to use the data in amazing new ways.
Aryeh Gregor wrote:
I'll emphasize from the start that I do *not* think either RDFa or microdata is suitable for dbpedia.org-style content. There's no reason we should put that in the HTML output, where it will take up tons of space and not be useful to HTML consumers (e.g., browsers and search engines).
Placing this data in your HTML documents has a direct impact on browsers and search engines. Browsers can collect triples and use them later to help you answer questions that you may have about a particular subject. Search engines can crawl the HTML and make their indexes more accurate based on semantic data that Wikipedia's pages expose.
RDF/XML, which was largely unsuccessful, was designed to be used for publishing in a dual-stream setup. It was expected that web publishers would publish semantic data beside web page data, just as you've proposed that Wikipedia does, but this proved to be far too difficult for most sites to manage both types of serializations.
Wikipedia is already short on developers, creating a new data stream is just going to exacerbate the problem. Besides, the way Wikipedia seems to be capturing data is via wikitext, not direct database entries. In effect, this community's database exists in the wikitext.
Aryeh Gregor wrote:
On the other other other hand, RDFa 1.1 is under development and looks like it will make major changes, so from that perspective microdata is arguably more stable.
There are new features going into RDFa 1.1, but classifying them as "major" changes makes it sound like RDFa 1.1 isn't going to be backwards-compatible with RDFa 1.0, when it most definitely is going to be backwards-compatible (except possibly for XMLLiterals, which was our bad).
The statement that "Microdata" is more stable because there are new features going into RDFa 1.1 is illogical. For example: just because there are new features going into the next version of Apache doesn't mean that it's any less "stable" for those that are using the current version today.
Aryeh Gregor wrote:
So, it's complicated. :) But from our perspective, I don't think there's a big difference in terms of stability or standard-ness, so I skipped over all this.
There's a huge difference in both stability and standard-ness - XHTML+RDFa is a W3C REC - it's a standard. Microdata and HTML+RDFa aren't even close to becoming a W3C REC. That's very important information for this community to consider.
When do you think that Microdata is going to be a REC at the W3C?
There were changes to the Microdata spec made by Ian less than 12 hours ago (January 18th 2010). If a spec is being actively edited, I don't think it's a good idea to say that it's stable and ready for deployment:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-January/024760.html
You are skipping over some pretty important stuff, kemosabe. :)
Aryeh Gregor wrote:
so converting the microdata graph to RDFa might be easier than the reverse.
Microdata's underlying model is triples as well - Microdata allows the limited expression of RDF. Since RDFa also supports the expression of RDF more formally, you can map Microdata to RDFa easier than you can map RDFa to Microdata (for some value of "easier).
You cannot, however, express RDF fully in Microdata - it is impossible in cases where it matters to Wikipedia (like data-typing).
Microdata doesn't support data typing (via @datatype), data value overriding (via @content), doesn't support URI short-handing via CURIEs (via @xmlns:PREFIX), and it doesn't support anonymous subjects via bnodes (blank nodes). The @datatype thing, @content thing and CURIE thing affects Wikipedia, not supporting bnodes doesn't necessarily impact the Wikipedia community, AFAICT.
Aryeh Gregor wrote:
I also think microdata is much easier to author for people with an HTML (not RDF) background -- template editors tend to have a good working knowledge of HTML, but not web-data technologies. I'd be interested in what Manu (or other RDFa supporters) has to say here.
I do think that Microdata has that going for it - in that the property names such as @itemref, @itemprop, etc. are easier to understand that @about, @datatype, @rel/@rev, and @content.
I'm all for making it easier for web authors to write this stuff, so the consistency of the itemXYZ attributes in Microdata was a good move. We didn't choose to do that for RDFa because we wanted to make the mapping from HTML to RDF explicit. The down-side with that is it requires authors to either have their RDFa autogenerated for them (which is the best thing for RDFa and Microdata), or it requires them to sit through a 10 minute tutorial on RDF (like the video at the top of this e-mail).
I do also think that Microdata has made several really big mistakes that we made in the Microformats community that were corrected in the RDFa community. Namely, not using CURIEs and adding the requirement that all URLs are repeated as many times as they're used. It's fine as an option, but not that great if one has to repeat a URL 50 times in a web page... which Wikipedia will eventually have to do if it is using Microdata.
http://rdfa.info/wiki/Developer-faq#Authoring
The FAQ above, which is a work in progress, is a good introduction to some of the common criticisms against RDFa and the reasoning behind the design decisions, for those that are interested.
The FAQ also addresses the fallacy that RDFa markup is, for real-world data, more verbose than Microdata markup.
Aryeh Gregor wrote:
Neither has more built-in validation than the other. Both allow arbitrary validation. RDFa seems to allow validation to be encoded in a more machine-readable format, but whether that's an advantage at all is debatable.
That's provably false. Microdata vocabulary validation is hard-coded in the specification. Dan Brickly and Ian Hickson had an IRC conversation about just this today[1]. In order to validate Microdata, you must first either convert it to RDF and even if you do, it will fail attempts to validate the literals that should have a datatype. If you want a Microdata vocabulary validator, you have to create one for each vocabulary... just like we had to do in the Microformats community, which some of us now recognize as a catastrophic mistake.
RDFa, via RDF, allows arbitrary data validation - one validator with any number of vocabularies. Microdata does not allow arbitrary validation - there must be one hard-coded validator per vocabulary.
-- manu
[1]http://krijnhoetmer.nl/irc-logs/whatwg/20100118#l-219
On Mon, Jan 18, 2010 at 5:34 PM, Manu Sporny msporny@digitalbazaar.com wrote:
Not necessarily. Javascript can use the RDFa on the page to generate more intuitive interfaces for the page.
Sure, but if we're providing the JavaScript, we could do it without RDFa just as well. Or can you provide a specific case where you think it would be easier for MediaWiki to implement some feature via RDFa (or microdata) than via any other means, not counting communication with outside software? Such cases might exist (like if there's a library to do it that already happens to use RDFa), but they'd be hard to find and debatable at best, I suspect.
Placing this data in your HTML documents has a direct impact on browsers and search engines. Browsers can collect triples and use them later to help you answer questions that you may have about a particular subject. Search engines can crawl the HTML and make their indexes more accurate based on semantic data that Wikipedia's pages expose.
*Can*. Yes, in theory. But do they? Will they? If not, then it's probably not worth the effort to put much work into it so speculatively, especially if it increases the complexity of editing. On the other hand, if they do implement feature X if you provide in-page metadata, would they be equally willing to use a separate RDF stream?
RDF/XML, which was largely unsuccessful, was designed to be used for publishing in a dual-stream setup. It was expected that web publishers would publish semantic data beside web page data, just as you've proposed that Wikipedia does, but this proved to be far too difficult for most sites to manage both types of serializations.
Is it managing two serializations that was the problem? Or just that most sites aren't willing to encode data in the hope that some consumer somewhere might use it for something in the future? Personally, I don't think it would be hard at all to maintain multiple data streams. The content is all script-generated anyway. We already have multiple ways to access the same data or subsets thereof in various formats, like:
http://en.wikipedia.org/wiki/RDFa http://en.wikipedia.org/wiki/RDFa?action=raw http://en.wikipedia.org/w/api.php?action=query&prop=categories&title... http://en.wikipedia.org/w/api.php?action=query&prop=extlinks&titles=... http://en.wikipedia.org/w/api.php?action=query&prop=templates&titles...
and many others. You can append &format=xml to the API queries to get them in proper XML, or &format=json for JSON, php for PHP array syntax, yaml for YAML, txt for plaintext, etc. It would be pretty simple to write a new API module or query prop or whatever that would retrieve any type of data from the wikitext of the page and format it as RDF or whatever else you liked.
Wikipedia is already short on developers, creating a new data stream is just going to exacerbate the problem.
No, it would be pretty simple, in my opinion as a MediaWiki developer.
There are new features going into RDFa 1.1, but classifying them as "major" changes makes it sound like RDFa 1.1 isn't going to be backwards-compatible with RDFa 1.0, when it most definitely is going to be backwards-compatible (except possibly for XMLLiterals, which was our bad).
I apologize if I inadvertently misrepresented the status of RDFa 1.1. I'm not familiar with RDFa, as I said.
There's a huge difference in both stability and standard-ness - XHTML+RDFa is a W3C REC - it's a standard. Microdata and HTML+RDFa aren't even close to becoming a W3C REC. That's very important information for this community to consider.
When do you think that Microdata is going to be a REC at the W3C?
I don't really care about formal status at the W3C. I care about providing useful features to users of Wikipedia and other MediaWiki wikis. Both RDFa and microdata are stable and usable enough right now that I think it's appropriate to evaluate them on their technical merits, not their theoretical spec status. We use plenty of things that aren't specified by any conventional standards body, like rel="canonical", OpenSearch, RSS, and so on. As long as they're well-specified de facto standards, it doesn't really matter who specifies them or what that group labels them -- why should it?
There were changes to the Microdata spec made by Ian less than 12 hours ago (January 18th 2010). If a spec is being actively edited, I don't think it's a good idea to say that it's stable and ready for deployment:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-January/024760.html
I don't see why not, as long as the editor is committed to avoiding backward-incompatible changes if possible. In the unlikely event something major comes up and there is such a change, it's not the end of the world -- we can deal with it when it comes up.
Microdata doesn't support data typing (via @datatype),
More precisely, it leaves it up to each vocabulary to determine how to handle data typing.
data value overriding (via @content),
<meta itemprop="foo" content="bar">?
doesn't support URI short-handing via CURIEs (via @xmlns:PREFIX),
It doesn't require URIs to be used for anything except one itemtype per item, so this isn't a big deal if you only have a few items of any given type per page (which would usually be the case for, e.g., image licenses).
and it doesn't support anonymous subjects via bnodes (blank nodes).
I'm not sure what this even means. :)
I do also think that Microdata has made several really big mistakes that we made in the Microformats community that were corrected in the RDFa community. Namely, not using CURIEs and adding the requirement that all URLs are repeated as many times as they're used. It's fine as an option, but not that great if one has to repeat a URL 50 times in a web page... which Wikipedia will eventually have to do if it is using Microdata.
Not if we only use it for a few things, like image licenses. Those are only displayed on the image description page, so it would be once per page in that case. I don't propose we use it for anything where we'd have fifty items per page.
RDFa seems longer even if you don't count the xmlns: stuff, anyway. Above, I found a microdata example to add 145 characters to the base markup, while equivalent RDFa (with xmlns:) added 305 characters. If you remove the two xmlns: declarations, I count only 86 characters saved, so RDFa still adds 219 characters, 50% more than microdata. So at best, microdata could save some space, but it's still significantly shorter than RDFa, at least for this example.
That's provably false. Microdata vocabulary validation is hard-coded in the specification. Dan Brickly and Ian Hickson had an IRC conversation about just this today[1]. In order to validate Microdata, you must first either convert it to RDF and even if you do, it will fail attempts to validate the literals that should have a datatype. If you want a Microdata vocabulary validator, you have to create one for each vocabulary... just like we had to do in the Microformats community, which some of us now recognize as a catastrophic mistake.
RDFa, via RDF, allows arbitrary data validation - one validator with any number of vocabularies. Microdata does not allow arbitrary validation - there must be one hard-coded validator per vocabulary.
I think you agreed with what I said. Both microdata and RDFa allow validation. RDFa allows some validation constraints to be expressed in a standard form, so they can be checked by generic RDFa validators. Microdata does not.
But it's not clear to me that this is a disadvantage in practice. Presumably anything that actually uses the data will necessarily be smart enough anyway to discard invalid data at no extra cost, so why not just do it at that stage? Or, if you're using a very small set of vocabularies as I propose MediaWiki does, you can assume that validators will exist for them anyway.
On Mon, Jan 18, 2010 at 23:34, Manu Sporny msporny@digitalbazaar.com wrote:
You cannot, however, express RDF fully in Microdata - it is impossible in cases where it matters to Wikipedia (like data-typing).
I'm not a Wikipedia developer or particularly active editor, but it sounds quite doubtful that XML Schema Datatypes matters to Wikipedia. Perhaps I haven't understood RDFa, but surely the vocabulary must define the datatype? If not, is @datatype a mandatory attribute that just adds dead weight all over the place? And if vocabularies do define the datatypes, why do you need to override them?
I do also think that Microdata has made several really big mistakes that we made in the Microformats community that were corrected in the RDFa community. Namely, not using CURIEs and adding the requirement that all URLs are repeated as many times as they're used. It's fine as an option, but not that great if one has to repeat a URL 50 times in a web page... which Wikipedia will eventually have to do if it is using Microdata.
There are other solutions to the "URLs are long" problem than prefix schemes. Incidentally http://n.whatwg.org/work is rather short, and I hope future vocabularies will have the good taste to use even shorter URLs.
That's provably false. Microdata vocabulary validation is hard-coded in the specification. Dan Brickly and Ian Hickson had an IRC conversation about just this today[1]. In order to validate Microdata, you must first either convert it to RDF and even if you do, it will fail attempts to validate the literals that should have a datatype.
Is the only kind of validation that RDF provides validation that something is the same kind of data it claims to be? That sounds similar to and as unhelpful as doctypes. What if the author doesn't set the datatype?
If you want a Microdata vocabulary validator, you have to create one for each vocabulary... just like we had to do in the Microformats community, which some of us now recognize as a catastrophic mistake.
RDFa, via RDF, allows arbitrary data validation - one validator with any number of vocabularies. Microdata does not allow arbitrary validation - there must be one hard-coded validator per vocabulary.
What are the exact mechanisms here? Does a RDFa validator dereference all predicates and try to get an RDF Schema to validate against? Doesn't that destroy any web server which hosts schemas for popular vocabularies (like with W3C doctypes)? On the other hand, if only the document itself is used, what kind of validation can be meaningful?
In any case, validators for microdata is something to be worked on, but I don't think either dereferencing vocabulary URLs or an official schema language is likely to be part of the solution (the latter because you need a full programming language to validate certain types of data, not just grammar rules).
wikitech-l@lists.wikimedia.org