Aryeh Gregor wrote:
RDFa is a way to embed data in HTML more robustly than with attributes like class and title, which are reserved for author use or have existing functionality. It allows you to specify an external vocabulary that adds some semantics to your page that HTML is not capable of expressing by itself.
More to the point, it allows an RDF graph to be overlaid onto an XHTML document so that the XHTML document and the RDF graph can share some strings. The XHTML data model isn't extended per se. Instead, a separate RDF graph can be extracted.
Both RDFa+HTML and Microdata are Working Drafts at the W3C right now
It's true that both HTML+RDFa and Microdata have been published in Working Drafts at the W3C. However, Microdata has never been through a Working Group Decision to publish as a First Public Working Draft while HTML+RDFa has. Microdata was added to a Working Draft after FPWD and there has since been a Working Group decision to take Microdata out of that spec.
It is reasonable to expect that soon HTML+RDFa and Microdata could be in the same stage Process-wise, but it's inaccurate to portray them as being at the same stage Process-wise right now.
I should note that currently Google and a couple of others support RDFa but not Microdata.
See http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2009Sep/0126.html (search for the word "deviate").
Manu Sporny wrote:
The general points that you made were riddled with technical inaccuracies, bad advice, and if implemented by the MediaWiki community, would have resulted in semantic data that would have been ambiguous at best and erroneous at worst.
With that introduction, I think it's fair to evaluate your message for inaccuracies or relevant omissions as well.
The above could be marked up in RDFa, with pre-defined vocabs, like so:
It should be noted that the concept of "pre-defined vocabs" is neither in the HTML+RDFa draft nor in the RDFa in XHTML spec from the XHTML2 WG.
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"> <span property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a rel="cc:attributionUrl" href=" http://example.org/bob/"
property="cc:attributionName">Bob Smith</span>
is licensed under a <a rel="license" href=" http://creativecommons.org/licenses/by-sa/3.0/us/"
Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
Hiding the CURIE declarations is a common pattern when advocating RDFa: It makes RDFa appear tidier than it is. To write this in RDFa in XHTML (the RDFa spec you say is safe to use for deployment), one would need to declare the CURIE prefixes:
<p xmlns:dctype="http://purl.org/dc/dcmitype/" about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" typeof="dctype:StillImage"> <span xmlns:dc="http://purl.org/dc/elements/1.1/" property="dc:title">Emery Molyneux Terrestrial Globe</span> by <a xmlns:cc="http://creativecommons.org/ns#" rel="cc:attributionUrl" href=" http://example.org/bob/"
property="cc:attributionName">Bob Smith</span> is licensed under a <a rel="license" href=" http://creativecommons.org/licenses/by-sa/3.0/us/"
Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
Philip Jägenstedt already covered other points about the examples.
However - XHTML1+RDFa is a published W3C Recommendation and it is safe to use it for deployment.
RDFa in XHTML has indeed been published as a Recommendation jointly by the Semantic Web Deployment Working Group and the XHTML2 Working Group. However, you fail to mention that even though the document mentions "HTML" in its first sentence, all the normative matter concerns strictly XHTML and the document has gone through the W3C Process as a specification that applies to XML.
MediaWiki uses the text/html and, thus, its pages get processed as HTML, so it would be inappropriate to rely on a spec that had been reviewed as an XML spec.
I think it's misleading to promote text/html deployment of specs whose normative matter has been written and reviewed for XML. The most egregious example of this is that the XHTML2 WG has written the normative matter of XHTML 1.x specs for XML but then published a Working Group Note (Notes can be pretty much anything and don't go through the W3C Recommendation track Process) that gives advice on deployment as text/html (http://www.w3.org/TR/xhtml-media-types/).
Furthermore, the ease of getting a spec to REC at the W3C depends on how many people are interested in the spec. The more people are interested in a spec, the more review comments there are. The flip side is that when there's *less* interest in a spec, it's easier to get it to Recommendation due to fewer comments raised. Thus, progress along the REC track isn't a commensurable indicator of technical merit or technical maturity across different specs and WGs.
Also, when assessing the "safe" deployability of RDFa in XHTML, it's relevant to consider that 1) RDFa in XHTML was knowingly (see http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-August/015913.html) progressed on the Recommendation track without resolving how RDFa works with HTML first. 2) An RDFa 1.1 is in the works, and the changes being considered make RDFa 1.0 look like a beta release. (Which is understandable, since a good part of the technical review of RDFa has occurred after RDFa in XHTML was rushed to REC.)
Since both RDFa and Microdata support the same underlying data model, and it's likely to take some time to resolve which will be the eventual winner, perhaps we should decouple the generation of the final HTML output from the markup of semantic text in articles.
Since it makes no sense to implement yet another incompatible "semantic wikitext" format for internal use, we will probably end up using something that is pretty close to one or the other, buried inside templates, to perform the actual in-wiki markup. Given this, is it worth considering which is easier for template authors to write, and which is easier to convert to the other -- RDFa to microdata or vice-versa?
-- Neil
On Mon, Jan 18, 2010 at 7:47 AM, Henri Sivonen hsivonen@iki.fi wrote:
It's true that both HTML+RDFa and Microdata have been published in Working Drafts at the W3C. However, Microdata has never been through a Working Group Decision to publish as a First Public Working Draft while HTML+RDFa has. Microdata was added to a Working Draft after FPWD and there has since been a Working Group decision to take Microdata out of that spec.
It is reasonable to expect that soon HTML+RDFa and Microdata could be in the same stage Process-wise, but it's inaccurate to portray them as being at the same stage Process-wise right now.
I simplified a bit, yes. The current Working Draft of HTML contains microdata, but the next one won't, but it seems almost certain that there will be a microdata FPWD published concurrently with the next HTML WD. So they're about the same -- both currently at WD, both almost certain to still be in WD at the next publication (microdata a bit less certain, but not much IMO).
On the other hand, RDFa+XHTML is a REC at the W3C, and currently we do output well-formed valid XHTML 1.0, albeit with a text/html MIME type. On the other other hand, microdata is at Last Call at the WHATWG, and we have an assurance of stability from its editor, independent of its status at the W3C. On the other other other hand, RDFa 1.1 is under development and looks like it will make major changes, so from that perspective microdata is arguably more stable.
So, it's complicated. :) But from our perspective, I don't think there's a big difference in terms of stability or standard-ness, so I skipped over all this.
On Mon, Jan 18, 2010 at 8:52 AM, Neil Harris neil@tonal.clara.co.uk wrote:
Since both RDFa and Microdata support the same underlying data model, and it's likely to take some time to resolve which will be the eventual winner, perhaps we should decouple the generation of the final HTML output from the markup of semantic text in articles.
Since it makes no sense to implement yet another incompatible "semantic wikitext" format for internal use, we will probably end up using something that is pretty close to one or the other, buried inside templates, to perform the actual in-wiki markup. Given this, is it worth considering which is easier for template authors to write, and which is easier to convert to the other -- RDFa to microdata or vice-versa?
AFAIK, Microdata is slightly less expressive than RDFa (it can't express cycles or something like that -- maybe someone else could clarify?), so converting the microdata graph to RDFa might be easier than the reverse. I also think microdata is much easier to author for people with an HTML (not RDF) background -- template editors tend to have a good working knowledge of HTML, but not web-data technologies. I'd be interested in what Manu (or other RDFa supporters) has to say here.
On Mon, Jan 18, 2010 at 9:46 AM, Daniel Kinzler daniel@brightbyte.de wrote:
Perhaps the right approach for us would be to have "some" syntax for providing this info, and then generating html5 microdata and/or rdfa into the rendered html, write the triple into a smw backend store, and provide rdf/xml/n3/whatever output via the api.
there are three aspects here: specify, store, output. perhaps we should look at them separately.
There are two separate things we want to do here, IMO:
1) Output a very few pieces of metadata that would be useful to HTML consumers, like license metadata. For these, we should use microdata or RDFa, maybe just with one or two vocabularies whitelisted, and it would be simplest to just let people type it into templates via wikitext. I'm pretty certain about this.
2) Output more generic metadata extracted from infoboxes and such. For this, I think we should use a separate RDF stream. I don't think we need to do conversion here -- we should be able to just publish template parameter triples as RDF, and let consumers convert it to something conventional via OWL or things like that. I'm less certain about this, because I know less about web data technology. I'd want to look more deeply into dbpedia or such before trying to solve this. We don't want to use RDFa or microdata here, that I'm quite sure of. I also don't think we need any kind of in-band semantics here, we should be able to just use template parameters so template authors don't have to be bothered.
So for license data specifically, I think our current best option is to use microdata for the wikitext input, and microdata for the HTML output. On the input side, this is
* Simpler for template editors to author. * More conveniently tailored to our precise use-case (at least microdata license vocabulary vs. the RDFa vocabs used so far, more convenient RDFa vocabs might exist).
On the output side, microdata
* Uses fewer bytes. * Validates better (I think -- there's an HTML5+microdata validator, but I know of no HTML5+RDFa validator). * Looks like it will have better support in browsers (e.g., Opera might be interested in exposing license metadata through a GUI).
We can always add new input formats or switch the output format later if we have good reason, though. Especially if we keep input restricted to one or two vocabularies -- or three, which for microdata is all of them right now. :)
* Aryeh Gregor Simetrical+wikilist@gmail.com [Mon, 18 Jan 2010 11:57:53 -0500]:
- Output a very few pieces of metadata that would be useful to HTML
consumers, like license metadata. For these, we should use microdata or RDFa, maybe just with one or two vocabularies whitelisted, and it would be simplest to just let people type it into templates via wikitext. I'm pretty certain about this.
There were the comparsions how many bytes the semantic data definition would take in RDFa or microdata, but it certainly takes even much less bytes to define the properties in SMW. If the SMW itself is not suitable, why not to borrow the compact [[::]] property definition syntax, at least? Then it will be possible to generate separate output in any desirable format. Why the long templates with XML tags are better? Dmitriy
There were the comparsions how many bytes the semantic data definition would take in RDFa or microdata, but it certainly takes even much less bytes to define the properties in SMW. If the SMW itself is not suitable, why not to borrow the compact [[::]] property definition syntax, at least? Then it will be possible to generate separate output in any desirable format. Why the long templates with XML tags are better? Dmitriy
I'd like to second this. Many third party wikis use templates and content from enwiki, commons, etc. and many of these third party wikis also use SMW. It would be really nice to have this be compatible.
Respectfully,
Ryan Lane
On Mon, Jan 18, 2010 at 3:14 PM, Dmitriy Sintsov questpc@rambler.ru wrote:
There were the comparsions how many bytes the semantic data definition would take in RDFa or microdata, but it certainly takes even much less bytes to define the properties in SMW. If the SMW itself is not suitable, why not to borrow the compact [[::]] property definition syntax, at least? Then it will be possible to generate separate output in any desirable format. Why the long templates with XML tags are better?
What would be sample SMW markup for the image example from my first post here? I can see two major problems with using SMW syntax for input:
1) Implementing microdata-output-as-microdata is trivial, and already implemented in core. SMW syntax would still need to be implemented in core, and this would probably be nontrivial.
2) We typically set a much higher bar for new wikisyntax than for new whitelisted HTML attributes. By our regular standards, I don't see how it would be justifiable to introduce new syntax for the sake of probably a tiny number of templates.
Maybe your syntax would be better here, but I don't really see the benefits except to those who are already using SMW.
* Aryeh Gregor Simetrical+wikilist@gmail.com [Mon, 18 Jan 2010 17:36:23 -0500]:
What would be sample SMW markup for the image example from my first post here?
Something like this: [[Property_name::Property_value]] I think the "third" (actually the first) value in triple will be the page where the property is defined. Also, the output of property value can be piped very much similar to the links.
[[work::http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg]] [[title::Emery Molyneux Terrestrial Globe]] [[author::Bob Smith]] [[license::http://creativecommons.org/licenses/by-sa/3.0/us/]]
While the pages with corresponding property names (work, author etc..) in NS_PROPERTY namespace will define the semantic types of data.
Someone stated that square-bracket syntax is harder to parse, favoring magic words / functions. Perhaps such definitions can be replaced by semantical magic words {{PROP|Name|Value}}, implemented for the Parser. Then, SMW itself can pick up that curly-brace syntax. The advantage is, that dumps of Wikipedia will be suitable for local SMW querying, even if there will be no SMW installed at Wikiemdia servers.
I can see two major problems with using SMW syntax for input:
- Implementing microdata-output-as-microdata is trivial, and already
implemented in core. SMW syntax would still need to be implemented in core, and this would probably be nontrivial.
Just a minimal usable subset of syntax, without the semantic queries, "concepts" and other harder stuff.
- We typically set a much higher bar for new wikisyntax than for new
whitelisted HTML attributes. By our regular standards, I don't see how it would be justifiable to introduce new syntax for the sake of probably a tiny number of templates.
Semantic data is potentially everywhere in Wikipedia pages.
Maybe your syntax would be better here, but I don't really see the benefits except to those who are already using SMW.
MW Parser may introduce a better integrated syntax with magic words, which can be later adpated by SMW for it's extended querying abiilties. Just an idea. Dmitriy
* Dmitriy Sintsov questpc@rambler.ru [Tue, 19 Jan 2010 10:40:16 +0300]: Probably even shorter:
[[Work::File:meryMolyneux-terrestrialglobe-1592-20061127.jpg]] [[Title::Emery Molyneux Terrestrial Globe]] [[Author::Bob Smith]] [[License::CC-BY-SA-3.0]]
(as Happy-melon suggested).
Such way the output of RDFa/microdata can be managed in the Parser. The properties belongs only to the page where they are defined and are not stored in the DB. Probably still would require NS_PROPERTY. Should not be hard to implement (one can easily switch to curly braces syntax.). I think of that as of very limited subset of SMW (no database backend and no queries). If someone needs to have semanic data stored in DB, he may import wiki dumps into SMW, then run maintenance semantic data refresh script (already included into the SMW). Dmitriy
Dmitriy Sintsov schrieb:
- Dmitriy Sintsov questpc@rambler.ru [Tue, 19 Jan 2010 10:40:16
+0300]: Probably even shorter:
[[Work::File:meryMolyneux-terrestrialglobe-1592-20061127.jpg]] [[Title::Emery Molyneux Terrestrial Globe]] [[Author::Bob Smith]] [[License::CC-BY-SA-3.0]]
(as Happy-melon suggested).
Such way the output of RDFa/microdata can be managed in the Parser. The properties belongs only to the page where they are defined and are not stored in the DB.
We most definitly want the ability to store these values in the database, and also to query them. Queries would howefver by very restricted: no on-page queries, and any public interface would be limited to very simple queries. Some values would be integrated with the general search interface, in order to enable e.g. searching for images by license and geographic region.
-- daniel
Dmitriy Sintsov schrieb:
- Aryeh Gregor Simetrical+wikilist@gmail.com [Mon, 18 Jan 2010
17:36:23 -0500]:
What would be sample SMW markup for the image example from my first post here?
Something like this: [[Property_name::Property_value]] I think the "third" (actually the first) value in triple will be the page where the property is defined. Also, the output of property value can be piped very much similar to the links.
The "third" value (the subject) would per *default* be the *subject* of the page the property is defined on, no the page itself. this is an extremly important distinction. Bein able to specify the subject explicitly would also be useful.
-- daniel
On Tue, 19 Jan 2010 10:40:16 +0300, Dmitriy Sintsov wrote:
- Aryeh Gregor Simetrical+wikilist@gmail.com [Mon, 18 Jan 2010 17:36:23
-0500]:
What would be sample SMW markup for the image example from my first post here?
Something like this: [[Property_name::Property_value]] I think the "third" (actually the first) value in triple will be the page where the property is defined. Also, the output of property value can be piped very much similar to the links.
[[work::http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg]] [[title::Emery Molyneux Terrestrial Globe]] [[author::Bob Smith]] [[license::http://creativecommons.org/licenses/by-sa/3.0/us/]]
While the pages with corresponding property names (work, author etc..) in NS_PROPERTY namespace will define the semantic types of data.
Someone stated that square-bracket syntax is harder to parse, favoring magic words / functions. Perhaps such definitions can be replaced by semantical magic words {{PROP|Name|Value}}, implemented for the Parser. Then, SMW itself can pick up that curly-brace syntax. The advantage is, that dumps of Wikipedia will be suitable for local SMW querying, even if there will be no SMW installed at Wikiemdia servers.
IIRC, the main stumbling block for SMW is that it's too monolithic, so a project can't push to get the parts of it enabled that they need, since it's pretty much all or nothing. I think a syntax that's more consistent with the rest of MediaWiki would simplify that.
The most obvious way to do that, particularly if you only want to create metadata without rendering links, would be just to use the normal parser function syntax {{#function:name=value|...}} i.e.
{{#smw: work=http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg%7C title=Emery Molyneux Terrestrial Globe | author = Bob Smith | license=http://creativecommons.org/licenses/by-sa/3.0/us/ }}
If you do want to render links, maybe something akin to the current image linking syntax would be useful, so at least some parsing code could be shared; i.e.
[[smw:work|http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg]] written by [[smw:author|Bob Smith]]
I can see two major problems with using SMW syntax for input:
- Implementing microdata-output-as-microdata is trivial, and already
implemented in core. SMW syntax would still need to be implemented in core, and this would probably be nontrivial.
Just a minimal usable subset of syntax, without the semantic queries, "concepts" and other harder stuff.
- We typically set a much higher bar for new wikisyntax than for new
whitelisted HTML attributes. By our regular standards, I don't see how it would be justifiable to introduce new syntax for the sake of probably a tiny number of templates.
Semantic data is potentially everywhere in Wikipedia pages.
Maybe your syntax would be better here, but I don't really see the benefits except to those who are already using SMW.
MW Parser may introduce a better integrated syntax with magic words, which can be later adpated by SMW for it's extended querying abiilties. Just an idea. Dmitriy
On Mon, Jan 18, 2010 at 17:57, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
AFAIK, Microdata is slightly less expressive than RDFa (it can't express cycles or something like that -- maybe someone else could clarify?)
It was Toby Inkster who pointed this out on public-html recently. [1] Here's my take on it: Since the microdata model is tree-like it can be mapped onto the graph model that is RDF, but the reverse isn't as trivial because of some quirks in RDF (XML Schema Datatypes and blank nodes, so far).
A typical RDF triple can be expressed as such in microdata: <div itemscope itemid="http://example.com/subject"> <link itemprop="http://example.com/predicate" href="http://example.com/object"> </div>
This isn't very readable and only something you would do if your goal is to translate existing RDF triples to HTML verbatim, mechanically (which seems like a strange thing to do given how much more readable e.g. Turtle is). You can create full RDF graphs simply be adding lots of triples. As you see, each node is identified by a URI: http://example.com/subject and http://example.com/object above. However, RDF also has "blank nodes" which aren't identified by anything at all. Microdata only creates these implicitly when converting to RDF, which means you can't deliberately "merge" two blank nodes.
Personally I doubt that this is going to be a problem. However, I'm quite certain that if a use case pops up which really does require finer control over blank nodes, then we can hardwire the _:blank URI scheme to do just that.
[1] http://lists.w3.org/Archives/Public/public-html/2010Jan/0794.html
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e21001180857x24bac57fp824c019956143d59@mail.gmail.com...
On Mon, Jan 18, 2010 at 7:47 AM, Henri Sivonen hsivonen@iki.fi wrote:
- Output a very few pieces of metadata that would be useful to HTML
consumers, like license metadata. For these, we should use microdata or RDFa, maybe just with one or two vocabularies whitelisted, and it would be simplest to just let people type it into templates via wikitext. I'm pretty certain about this.
Eh? I get the feeling that we're reading from totally different song sheets here. You seem to be saying here is that you expect the use case to be 'license templates on steroids': on the image description page, we have license templates that now emit microdata/RDF/the-metadata-format-of-the-month, which can be picked up by whoever is interested. That's not MediaWiki doing anything active with the data, and it's absolutely no different from marking up infoboxes. In fact, the usecase for infoboxes is arguably stronger, because their data structure is more complicated and harder to machine-read otherwise.
What I had assumed we meant by "MediaWiki do stuff with metadata" would be to pick up metadata about an image, and then output that **wherever the image is used**. So when you view an article with an image, that use of the image has a metadata cloud that describes where the image is from, what its license is, whatever. Information that, for an external image, might not be available via JavaScript or other means. I see things like the "put-a-red-border-round-fair-use-images" script I have in my monobook being implemented just by picking out that metadata, and without having to run stacks of api queries.
That usecase is incredibly badly served by just allowing raw metadata in the image page wikitext; it's really no different to adding categories via a license template. MediaWiki needs to have that metadata stored separately from wikitext, or at least entered via wikitext in a parser-friendly way: the customary way for the parser to pick 'stuff' out of wikitext is with parser functions, magic words, link syntax, whatever.
We can always add new input formats or switch the output format later if we have good reason, though. Especially if we keep input restricted to one or two vocabularies -- or three, which for microdata is all of them right now. :)
Again, I don't know which side of the coin you're talking about: switching the output format is trivial *iff* there's a disjoint between the input and output. If MW is extracting its metadata by reading [format] out of wikitext, then *adding* new formats becomes a PITA, and *removing* formats becomes impossible. So much better to have a format-independent input system for extracting metadata, and then be able to implement any of a range of outputs as dictated by the times.
--HM
On Mon, Jan 18, 2010 at 6:41 PM, Happy-melon happy-melon@live.com wrote:
Eh? I get the feeling that we're reading from totally different song sheets here. You seem to be saying here is that you expect the use case to be 'license templates on steroids': on the image description page, we have license templates that now emit microdata/RDF/the-metadata-format-of-the-month, which can be picked up by whoever is interested.
Right. We know that web spiders are interested in picking up this metadata automatically.
That's not MediaWiki doing anything active with the data, and it's absolutely no different from marking up infoboxes. In fact, the usecase for infoboxes is arguably stronger, because their data structure is more complicated and harder to machine-read otherwise.
I'm not clear what your analogy to infoboxes is about.
What I had assumed we meant by "MediaWiki do stuff with metadata" would be to pick up metadata about an image, and then output that **wherever the image is used**. So when you view an article with an image, that use of the image has a metadata cloud that describes where the image is from, what its license is, whatever.
Ah, I see. I don't think we want to do that. There's no end to the amount of metadata you could shove into a page in machine-readable format -- we'd be talking serious markup bloat here if you start adding things on the basis of "someone will surely find it useful". I wouldn't want to add any extra output on every page unless we had a known, concrete use for it.
That usecase is incredibly badly served by just allowing raw metadata in the image page wikitext; it's really no different to adding categories via a license template.
It's no different, except that RDFa/microdata are relatively standard, so third parties don't have to special-case MediaWiki and can use the same code to figure out licenses on all sites. That's the only advantage.
Again, I don't know which side of the coin you're talking about: switching the output format is trivial *iff* there's a disjoint between the input and output.
Well, the idea is you could accept microdata as input, and transform it into a different format for output if in the future you decided you didn't like microdata. So you could add the disjointness between input and output at a later date if it's needed then.
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e21001181612t84d5c90kc16ccc8724ca5b72@mail.gmail.com...
On Mon, Jan 18, 2010 at 6:41 PM, Happy-melon happy-melon@live.com wrote:
Eh? I get the feeling that we're reading from totally different song sheets here. You seem to be saying here is that you expect the use case to be 'license templates on steroids': on the image description page, we have license templates that now emit microdata/RDF/the-metadata-format-of-the-month, which can be picked up by whoever is interested.
Right. We know that web spiders are interested in picking up this metadata automatically.
That's not MediaWiki doing anything active with the data, and it's absolutely no different from marking up infoboxes. In fact, the usecase for infoboxes is arguably stronger, because their data structure is more complicated and harder to machine-read otherwise.
I'm not clear what your analogy to infoboxes is about.
I was saying that license templates are significantly easier to machine-read than infoboxes, because their data is simpler. The ultimate goal is, as you say, to allow machine reading without bespoke parsing, but that's a long way down the line.
What I had assumed we meant by "MediaWiki do stuff with metadata" would be to pick up metadata about an image, and then output that **wherever the image is used**. So when you view an article with an image, that use of the image has a metadata cloud that describes where the image is from, what its license is, whatever.
Ah, I see. I don't think we want to do that. There's no end to the amount of metadata you could shove into a page in machine-readable format -- we'd be talking serious markup bloat here if you start adding things on the basis of "someone will surely find it useful". I wouldn't want to add any extra output on every page unless we had a known, concrete use for it.
At least we now *know* we're talking about different things :-D I agree there are gradations of what is 'worth' putting into the markup; although ""adding things on the basis of 'someone will surely find it useful'"" is **exactly** what we will get if we allow the busy bee template developers access to a metadata markup, almost by definition. I would say it's definitely 'worth' exposing license metadata on every use of an image; the status of a page's images affects our whole terms of use, whether we can say "yes you can use all this in this fashion" verses "you have to jump through these hoops for these images because they're different". Author, location, capture date; yes these probably aren't 'worth' the cost of exposing on pages. But being able to search commons for all photos taken in Berlin between 1989 and 1991 would be worth its weight in gold.
That usecase is incredibly badly served by just allowing raw metadata in the image page wikitext; it's really no different to adding categories via a license template.
It's no different, except that RDFa/microdata are relatively standard, so third parties don't have to special-case MediaWiki and can use the same code to figure out licenses on all sites. That's the only advantage.
...
Well, the idea is you could accept microdata as input, and transform it into a different format for output if in the future you decided you didn't like microdata. So you could add the disjointness between input and output at a later date if it's needed then.
Indeed, but that's data *output*, not input. Currently our categories are input via [[Category:Foo]] and output via some HTML at the bottom of the page, but also via the API in a variety of formats; people use both methods to extract the metadata. Once MW knows what data an object has, how it outputs that data back is totally open as you say. So given that a translation into a format that MW understands is desirable for its own sake, and that from there it's trivial to translate back into whatever output format(s) the current web demands, why would we choose an input format like
<span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>
Rather than an input format like [[License::CC-BY-SA-3.0]]??
--HM
On Mon, Jan 18, 2010 at 7:34 PM, Happy-melon happy-melon@live.com wrote:
I was saying that license templates are significantly easier to machine-read than infoboxes, because their data is simpler. The ultimate goal is, as you say, to allow machine reading without bespoke parsing, but that's a long way down the line.
No it's not. Google already does it for RDFa and microformats. Any major user of microdata would encourage them to support that too (especially since they invented it). Multiple browsers have also announced interest in supporting microdata.
At least we now *know* we're talking about different things :-D
Yep. :P
I agree there are gradations of what is 'worth' putting into the markup; although ""adding things on the basis of 'someone will surely find it useful'"" is **exactly** what we will get if we allow the busy bee template developers access to a metadata markup, almost by definition.
I bet very few people would bother adding metadata without a concrete use. And they'd probably get into fights with other people annoyed at them for making it harder to edit wikitext. This would all be irrelevant if we only supported a few whitelisted vocabularies, though, as the current microdata implementation does. We should encourage bulky and not-so-useful stuff to go in a separate stream.
I would say it's definitely 'worth' exposing license metadata on every use of an image; the status of a page's images affects our whole terms of use, whether we can say "yes you can use all this in this fashion" verses "you have to jump through these hoops for these images because they're different". Author, location, capture date; yes these probably aren't 'worth' the cost of exposing on pages. But being able to search commons for all photos taken in Berlin between 1989 and 1991 would be worth its weight in gold.
Sure -- but that can be exposed in a separate data stream, since
99.9% of page views won't need it.
Indeed, but that's data *output*, not input. Currently our categories are input via [[Category:Foo]] and output via some HTML at the bottom of the page, but also via the API in a variety of formats; people use both methods to extract the metadata. Once MW knows what data an object has, how it outputs that data back is totally open as you say. So given that a translation into a format that MW understands is desirable for its own sake, and that from there it's trivial to translate back into whatever output format(s) the current web demands, why would we choose an input format like
<span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>
Rather than an input format like [[License::CC-BY-SA-3.0]]??
First, why are you asking me why we would choose RDFa when I don't think we should? At least quote microdata.
Second, this is apples to oranges. Your RDFa sample a) says that the work is a still image, b) gives its name, c) gives the author's name, d) gives the URL of the license, e) contains user-visible prose. Your wikitext sample just gives the license name (not even a license URL!). No kidding the latter is shorter. A more realistic comparison might be
<p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
vs.
<p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]] by [[author::Bob Smith|]] is licensed under a [[license::http://creativecommons.org/licenses/by-sa/3.0/us/%7C%5Bhttp://creativecommon... Creative Commons Attribution-Share Alike 3.0 United States License]]].</p>
or something, which is not such an easy call. The wikitext is not that much shorter or simpler -- particularly when you account for the fact that you'd have to separately define mappings to concrete microdata/RDFa/RDF vocabularies for output. (Yes, I left out the itemtype on the microdata, but again, that would have to be defined somewhere for the wikisyntax too.)
On Mon, Jan 18, 2010 at 7:47 PM, Manu Sporny msporny@digitalbazaar.com wrote:
Looks like I've had my hand slapped twice during this discussions. I thought this was the first warning, but David seems to think differently. That means that either I've been too aggressive or David is not familiar with the level of intensity surrounding the Microdata/RDFa debates.
That veiled insults and questioning others' motives is par for the course on public-html doesn't mean we're going to tolerate it here. It shouldn't happen there either, of course, but we can't help that.
I strongly disagree with the idea of getting Microdata integrated with Wikipedia at this stage, before REC
This is just not a reasonable position to take outside the ivory tower of standards-making. We are not going to deny our users useful features just because some spec somewhere that happens to describe the feature is not absolutely 100% fully finished. We use zillions of features that aren't in any spec at all, or are only in Working Draft, as do all authors. Do you really think we shouldn't be using CSS3 Selectors or CSS2.1 until they're REC? Should we only use a Java video player even when multiple browsers support a much better *and* more standards-compliant experience via <video>, just because HTML5 is still a WD?
This is just not tenable. We use features when they're useful, not when someone else thinks we should use them. Our goal is to serve our users, not spec writers. Users above authors above implementers above specification writers . . .
On Tue, Jan 19, 2010 at 2:40 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
[[work::http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg]] [[title::Emery Molyneux Terrestrial Globe]] [[author::Bob Smith]] [[license::http://creativecommons.org/licenses/by-sa/3.0/us/]]
We could use this, but I don't see a big advantage over raw microdata if a) we'll be outputting as microdata at first anyway, and b) it's only expected to be used for a very few things like licenses, presumably hidden away behind templates. If it is done, though, it should be with curly braces for sanity's sake: {{#prop:author|Bob Smith}} or whatnot.
This sort of thing might be good syntax for a separate RDF stream, but I think we can keep that simpler. Instead of having {{Infobox foo|name=Bob Smith}} contain, somewhere, {{#prop:name|{{{name}}}}}, creating the triple (page name, 'name', 'Bob Smith') for the page, why not just leave out the #prop and have *every* template parameter create a triple? So {{foo|bar=baz|quuz=quuuz}} would create the triples (page name, 'foo|bar', 'baz'), (page name, 'foo|quuz', 'quuuz') with no extra markup needed. The triples could then be transformed into a more useful form by the consumer, using a language like OWL. This is something like how dbpedia.org works right now, AFAICT.
We could use this, but I don't see a big advantage over raw microdata if a) we'll be outputting as microdata at first anyway, and b) it's only expected to be used for a very few things like licenses, presumably hidden away behind templates. If it is done, though, it should be with curly braces for sanity's sake: {{#prop:author|Bob Smith}} or whatnot.
Who's to say it would only be used for licenses? I understand that this is the immediate want for microdata (or RDFa), but that want could expand at any time, and could expand greatly. Also, Wikimedia may only want to use it for licensing, but third party sites may choose to use it for far more than that.
Why shouldn't we use a technology neutral input format? What happens if microdata is replaced by something better/easier/simpler? I also don't necessarily think we should lock users into a certain technology. If we choose a nuetral input format, users can decide which output they wish to use (via extensions).
This sort of thing might be good syntax for a separate RDF stream, but I think we can keep that simpler. Instead of having {{Infobox foo|name=Bob Smith}} contain, somewhere, {{#prop:name|{{{name}}}}}, creating the triple (page name, 'name', 'Bob Smith') for the page, why not just leave out the #prop and have *every* template parameter create a triple? So {{foo|bar=baz|quuz=quuuz}} would create the triples (page name, 'foo|bar', 'baz'), (page name, 'foo|quuz', 'quuuz') with no extra markup needed. The triples could then be transformed into a more useful form by the consumer, using a language like OWL. This is something like how dbpedia.org works right now, AFAICT.
It is nice to be able to define an ontology separate from template parameters. In this scheme, every ontology that shares properties has to define the same template parameters. I think there are likely a lot of Wikipedia templates that share common properties, but do not use common template parameters. I for sure know my local wiki has templates with different parameters that use the same properties.
This also means that templates would create dependencies on one another, since parameters need to stay the same to keep the same properties; this would be hell on those who maintain templates.
Respectfully,
Ryan Lane
On Wed, Jan 20, 2010 at 10:02 AM, Lane, Ryan Ryan.Lane@ocean.navo.navy.mil wrote:
Why shouldn't we use a technology neutral input format? What happens if microdata is replaced by something better/easier/simpler? I also don't necessarily think we should lock users into a certain technology. If we choose a nuetral input format, users can decide which output they wish to use (via extensions).
Doesn't this same argument apply to using any new HTML feature?
On Wed, Jan 20, 2010 at 11:47 AM, Happy-melon happy-melon@live.com wrote:
You could say that we're talking about different things again; that you're talking about marking up data for external use. But there's no reason why a {{#prop:foo|bar}} magic word can't *also* output some appropriate metadata format into the wikitext. Marking up in a format-neutral syntax allows us to output metadata from wikitext *and* from MW generally, and to change *both* formats at the drop of a hat. Marking up in a particular format, whatever the format is, makes it damn near impossible (or at least hopelessly hackish) to change wikitext output from one format to another, and equally horrible for MW to collect data at all.
Okay, I'll grant that for an RDF-style use-case, parser functions are a better bet than the alternatives. However, I'm not sure that's the case for inline markup, in the limited cases where we want that (e.g., image licenses). The problem here is that you'd have to associate the metadata with particular phrases. You can't say {{#prop:license|CC-BY-SA-2.0}} and output that as proper microdata/RDFa -- or rather you could, but only by creating empty content nodes someplace. I guess that would work . . . it's not good practice if you're hand-authoring, and it would take a bit more space, but it might indeed make sense from our POV.
But then there's the question of writing it. The code for raw microdata/RDFa output is already written, and is pretty trivial besides. Is anyone willing to write core code to do this metadata abstraction with a parser function, and output in appropriate formats? If not, the choice is microdata, RDFa, or nothing.
On Wed, Jan 20, 2010 at 1:38 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
I do not like the idea of having a parser function that outputs the data into the article - if people want the meta-data they can query it from an API, or a dump, as opposed to screen-scraping. Perhaps meta-data on image pages is useful, but if someone wants to get licenses of all the images, surely providing a single file containing all is better than screen-scraping for it
Not for search engines. They're spidering all the pages anyway, so it's easier for them to not retrieve a separate page. Besides, how would they know how to find the metadata if it's not included or pointed to on the page in some standard format?
On Wed, Jan 20, 2010 at 7:10 PM, Manu Sporny msporny@digitalbazaar.com wrote:
Aryeh, you're quoting something that I purposefully said off-list in an attempt to save this mailing list from the RDFa/Microdata tumult.
Oops! I'm *really* sorry. I didn't notice it was an off-list reply, so I copy-pasted it into my on-list reply to Happy-melon. Gmail threads off-list replies in the same conversation and doesn't provide any obvious visual cues. So I honestly thought that was an on-list reply. Sorry about that! I think it was a very nice and thoughtful reply overall, and hope I didn't do too much damage by (partially) publicizing it.
I will be responding shortly to the remaining questions that have been unanswered during this discussion and then leaving the discussion entirely. I don't feel that we are having a productive discussion here and the damage that I fear is resulting is the rejection of both Microdata and RDFa.
Based on current discussion, I think we'll end up going with one or the other for image licenses, probably with a toggle to use whichever you prefer. If someone writes the code to do that, which is a significant "if".
* Aryeh Gregor Simetrical+wikilist@gmail.com [Thu, 21 Jan 2010 18:55:38 -0500]:
Okay, I'll grant that for an RDF-style use-case, parser functions are a better bet than the alternatives. However, I'm not sure that's the case for inline markup, in the limited cases where we want that (e.g., image licenses). The problem here is that you'd have to associate the metadata with particular phrases. You can't say {{#prop:license|CC-BY-SA-2.0}} and output that as proper microdata/RDFa -- or rather you could, but only by creating empty content nodes someplace. I guess that would work . . . it's not good practice if you're hand-authoring, and it would take a bit more space, but it might indeed make sense from our POV.
To output the magic word / parser function as a properly defined metadata, one has to define the type of "license" (associate with "vocabulary" which has proper xmlns) in the source of definition page in NS_PROPERTY namespace, which has address [[Property:license]]. On the Property:License page there will be a mapping of "CC-BY-SA-2.0" string to "expanded xmlized value".
But then there's the question of writing it. The code for raw microdata/RDFa output is already written, and is pretty trivial besides. Is anyone willing to write core code to do this metadata abstraction with a parser function, and output in appropriate formats? If not, the choice is microdata, RDFa, or nothing.
Experts in Parser probably can do that fast enough, however him / they might be busy with more important jobs.
Not for search engines. They're spidering all the pages anyway, so it's easier for them to not retrieve a separate page. Besides, how would they know how to find the metadata if it's not included or pointed to on the page in some standard format?
Duesentrieb pointed out that MediaWiki built-in search can benefit from that, as well. Both metadata generation for the external engines and internal storage (similar to SMW) could be implemented. However, it's much simplier to perform metadata generation without the storage backend (less optimization, scalability issues, code review). This way the goal can be achieved in incremental steps. SMW can be adapted to use such syntax, then it can be used to retrieve the data from Wikipedia dumps, by installing the SMW at external, less-trusted server (toolserver?), where the code review is not so critically important. Dmitriy
On Wed, Jan 20, 2010 at 2:38 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
No kidding the latter is shorter. A more realistic comparison might be
<p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
vs.
<p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]] by [[author::Bob Smith|]] is licensed under a [[license::http://creativecommons.org/licenses/by-sa/3.0/us/|[http://creativecommons.org/licenses/by-sa/3.0/us/ Creative Commons Attribution-Share Alike 3.0 United States License]]].</p>
Hmm... If <span itemprop> and <span href> were (are?) allowed tag/attribute combinations in MediaWiki, we could just do
{{RDFa|title|EmeryMolyneux-terrestrialglobe-1592-20061127.jpg}} by {{RDFa|author|Bob Smith}} is licensed under a {{RDFa|license|Commons Attribution-Share Alike 3.0 United States License|http://creativecommons.org/licenses/by-sa/3.0/us/%7D%7D.
with {{RDFa}} like <span itemprop="{{{1}}}" href="{{{3}}}">{{{2}}}</span>
(add clever {{#if}} for the href part)
Magnus
I bet very few people would bother adding metadata without a concrete use.
This whole discussion about meta-data sounds eerily familiar...
From: http://en.wikipedia.org/wiki/SGML
"As a document markup language, SGML was originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry. Many of these documents must remain readable for several decades — a long time in the information technology field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing businesses. The advent of the XML profile has made SGML suitable for widespread application for small-scale, general-purpose use."
So now XML isn't light enough... :-)
Mark W.
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e21001200638y759365c8oeecd8f06f761a583@mail.gmail.com...
On Mon, Jan 18, 2010 at 7:34 PM, Happy-melon happy-melon@live.com wrote:
I bet very few people would bother adding metadata without a concrete use. And they'd probably get into fights with other people annoyed at them for making it harder to edit wikitext. This would all be irrelevant if we only supported a few whitelisted vocabularies, though, as the current microdata implementation does. We should encourage bulky and not-so-useful stuff to go in a separate stream.
Yes, very few people would bother. Those few people would still introduce a monstrous amount of extra markup by working deep in the template stack. Doesn't take much to add kilobytes to large articles; I've added 5kb to [[Barack Obama]] myself just by adding a span round reference brackets. Just adding author metadata to citation templates would add seconds to load times for large articles.
I would say it's definitely 'worth' exposing license metadata on every use of an image; the status of a page's images affects our whole terms of use, whether we can say "yes you can use all this in this fashion" verses "you have to jump through these hoops for these images because they're different". Author, location, capture date; yes these probably aren't 'worth' the cost of exposing on pages. But being able to search commons for all photos taken in Berlin between 1989 and 1991 would be worth its weight in gold.
Sure -- but that can be exposed in a separate data stream, since
99.9% of page views won't need it.
I'm not talking about exposing it in a data stream per se, I'm suggesting that that's what our internal search would be able to achieve if the metadata was accessible to MediaWiki.
Indeed, but that's data *output*, not input. Currently our categories are input via [[Category:Foo]] and output via some HTML at the bottom of the page, but also via the API in a variety of formats; people use both methods to extract the metadata. Once MW knows what data an object has, how it outputs that data back is totally open as you say. So given that a translation into a format that MW understands is desirable for its own sake, and that from there it's trivial to translate back into whatever output format(s) the current web demands, why would we choose an input format like
<span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>
Rather than an input format like [[License::CC-BY-SA-3.0]]??
First, why are you asking me why we would choose RDFa when I don't think we should? At least quote microdata.
Second, this is apples to oranges. Your RDFa sample a) says that the work is a still image, b) gives its name, c) gives the author's name, d) gives the URL of the license, e) contains user-visible prose. Your wikitext sample just gives the license name (not even a license URL!). No kidding the latter is shorter. A more realistic comparison might be
<p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
vs.
<p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]] by [[author::Bob Smith|]] is licensed under a [[license::http://creativecommons.org/licenses/by-sa/3.0/us/|[http://creativecommons.org/licenses/by-sa/3.0/us/ Creative Commons Attribution-Share Alike 3.0 United States License]]].</p>
or something, which is not such an easy call. The wikitext is not that much shorter or simpler -- particularly when you account for the fact that you'd have to separately define mappings to concrete microdata/RDFa/RDF vocabularies for output. (Yes, I left out the itemtype on the microdata, but again, that would have to be defined somewhere for the wikisyntax too.)
True, the markup Dmitry offered is more suitable. But Ryan is absolutely right. You're only thinking about the the *current* generation of formats, and assuming (maybe legitimately, I don't know) that microdata is the best format for us to use. What happens when the next generation of format(s) come out? With a format-neutral input format, MW sites can quickly adapt to accommodate it. Plus this method of data-injection will much more work to allow MW to extract the data from the wikitext, which puts our searching for photos in Berlin issue further out of reach.
You could say that we're talking about different things again; that you're talking about marking up data for external use. But there's no reason why a {{#prop:foo|bar}} magic word can't *also* output some appropriate metadata format into the wikitext. Marking up in a format-neutral syntax allows us to output metadata from wikitext *and* from MW generally, and to change *both* formats at the drop of a hat. Marking up in a particular format, whatever the format is, makes it damn near impossible (or at least hopelessly hackish) to change wikitext output from one format to another, and equally horrible for MW to collect data at all.
--HM
On 01/20/2010 04:47 PM, Happy-melon wrote:
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e21001200638y759365c8oeecd8f06f761a583@mail.gmail.com...
On Mon, Jan 18, 2010 at 7:34 PM, Happy-melon happy-melon@live.com wrote:
I bet very few people would bother adding metadata without a concrete use. And they'd probably get into fights with other people annoyed at them for making it harder to edit wikitext. This would all be irrelevant if we only supported a few whitelisted vocabularies, though, as the current microdata implementation does. We should encourage bulky and not-so-useful stuff to go in a separate stream.
Yes, very few people would bother. Those few people would still introduce a monstrous amount of extra markup by working deep in the template stack. Doesn't take much to add kilobytes to large articles; I've added 5kb to [[Barack Obama]] myself just by adding a span round reference brackets. Just adding author metadata to citation templates would add seconds to load times for large articles.
I would say it's definitely 'worth' exposing license metadata on every use of an image; the status of a page's images affects our whole terms of use, whether we can say "yes you can use all this in this fashion" verses "you have to jump through these hoops for these images because they're different". Author, location, capture date; yes these probably aren't 'worth' the cost of exposing on pages. But being able to search commons for all photos taken in Berlin between 1989 and 1991 would be worth its weight in gold.
Sure -- but that can be exposed in a separate data stream, since
99.9% of page views won't need it.
I'm not talking about exposing it in a data stream per se, I'm suggesting that that's what our internal search would be able to achieve if the metadata was accessible to MediaWiki.
Indeed, but that's data *output*, not input. Currently our categories are input via [[Category:Foo]] and output via some HTML at the bottom of the page, but also via the API in a variety of formats; people use both methods to extract the metadata. Once MW knows what data an object has, how it outputs that data back is totally open as you say. So given that a translation into a format that MW understands is desirable for its own sake, and that from there it's trivial to translate back into whatever output format(s) the current web demands, why would we choose an input format like
<span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>
Rather than an input format like [[License::CC-BY-SA-3.0]]??
First, why are you asking me why we would choose RDFa when I don't think we should? At least quote microdata.
Second, this is apples to oranges. Your RDFa sample a) says that the work is a still image, b) gives its name, c) gives the author's name, d) gives the URL of the license, e) contains user-visible prose. Your wikitext sample just gives the license name (not even a license URL!). No kidding the latter is shorter. A more realistic comparison might be
<p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
vs.
<p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]] by [[author::Bob Smith|]] is licensed under a [[license::http://creativecommons.org/licenses/by-sa/3.0/us/|[http://creativecommons.org/licenses/by-sa/3.0/us/ Creative Commons Attribution-Share Alike 3.0 United States License]]].</p>
or something, which is not such an easy call. The wikitext is not that much shorter or simpler -- particularly when you account for the fact that you'd have to separately define mappings to concrete microdata/RDFa/RDF vocabularies for output. (Yes, I left out the itemtype on the microdata, but again, that would have to be defined somewhere for the wikisyntax too.)
True, the markup Dmitry offered is more suitable. But Ryan is absolutely right. You're only thinking about the the *current* generation of formats, and assuming (maybe legitimately, I don't know) that microdata is the best format for us to use. What happens when the next generation of format(s) come out? With a format-neutral input format, MW sites can quickly adapt to accommodate it. Plus this method of data-injection will much more work to allow MW to extract the data from the wikitext, which puts our searching for photos in Berlin issue further out of reach.
You could say that we're talking about different things again; that you're talking about marking up data for external use. But there's no reason why a {{#prop:foo|bar}} magic word can't *also* output some appropriate metadata format into the wikitext. Marking up in a format-neutral syntax allows us to output metadata from wikitext *and* from MW generally, and to change *both* formats at the drop of a hat. Marking up in a particular format, whatever the format is, makes it damn near impossible (or at least hopelessly hackish) to change wikitext output from one format to another, and equally horrible for MW to collect data at all.
I do not like the idea of having a parser function that outputs the data into the article - if people want the meta-data they can query it from an API, or a dump, as opposed to screen-scraping. Perhaps meta-data on image pages is useful, but if someone wants to get licenses of all the images, surely providing a single file containing all is better than screen-scraping for it (even RDFa/microdata is screen scraping, in my opinion; it's just done with the hope that a developer has made it easy for you - you will still have to deal with invalid uses of markup, and the more complicated the markup, the more it will be used invalidly).
I would not be against using whitelisting necessary attributes to allow wikis to put in these formats manually.
I do like the idea (a lot) of having a parser function that can put data into a storage model inside MediaWiki (probably tabular, ideally relational) that can be dumped like the current articles or queried using the API. My original thoughts [0] had the wiki's technocrat's define a few "tables" which could be populated with the {{#store}} command.
Conrad
[0] http://en.wiktionary.org/w/index.php?oldid=6304302
--HM
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Jan 18, 2010 at 7:47 PM, Manu Sporny msporny@digitalbazaar.com wrote:
Looks like I've had my hand slapped twice during this discussions. I thought this was the first warning, but David seems to think differently. That means that either I've been too aggressive or David is not familiar with the level of intensity surrounding the Microdata/RDFa debates.
That veiled insults and questioning others' motives is par for the course on public-html doesn't mean we're going to tolerate it here. It shouldn't happen there either, of course, but we can't help that.
I strongly disagree with the idea of getting Microdata integrated with Wikipedia at this stage, before REC
This is just not a reasonable position to take outside the ivory tower of standards-making.
Aryeh, you're quoting something that I purposefully said off-list in an attempt to save this mailing list from the RDFa/Microdata tumult.
It was an e-mail that I specifically sent to you, Henri, Philip and David Gerard. That e-mail, which contained a heart-felt apology (in case I had come across as too aggressive) has now been openly quoted out of context. I say this because that e-mail was specifically off-list and thus nobody on the list could discover the full context of the statements above. The off-list e-mail is attached below in order to further explain why I won't be participating in this discussion.
Here are the sequence of events that lead us to this point:
* Aryeh Gregor posts a call to integrate Microdata into Mediawiki[1]. * Manu Sporny, noticing a number of factual errors, responds[2] to the initial e-mail. The focus of the arguments are around the non-maturity of both the Microdata and HTML5+RDFa proposals. * A number of e-mails are exchanged during which Manu Sporny is warned to not use the term FUD or other charged language by David Gerard. * Manu Sporny, concerned that the debate is damaging both of the Microdata and RDFa initiatives, decides to apologize and discuss the issue off-list with the respondents. * Aryeh Gregor quotes portions of the off-list discussion, out of context, and responds[3] directly to the mailing list. * Manu Sporny responds with this e-mail.
[1]http://lists.wikimedia.org/pipermail/wikitech-l/2010-January/046382.html [2]http://lists.wikimedia.org/pipermail/wikitech-l/2010-January/046386.html [3]http://lists.wikimedia.org/pipermail/wikitech-l/2010-January/046466.html
I will be responding shortly to the remaining questions that have been unanswered during this discussion and then leaving the discussion entirely. I don't feel that we are having a productive discussion here and the damage that I fear is resulting is the rejection of both Microdata and RDFa.
If there are any further questions related to RDFa, please ask them on the RDFa Task Force mailing list (a public mailing list) and we will do the best that we can to answer them:
http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/
Below is the off-list e-mail that I sent to Aryeh, Philip, Henri and David.
--------------------------------
Message-ID: 4B55010B.6090607@digitalbazaar.com Date: Mon, 18 Jan 2010 19:47:07 -0500 From: Manu Sporny msporny@digitalbazaar.com User-Agent: Mozilla-Thunderbird 2.0.0.19 (X11/20090103) MIME-Version: 1.0 To: Henri Sivonen hsivonen@iki.fi, Aryeh Gregor Simetrical+wikilist@gmail.com, =?ISO-8859-1?Q?Philip_J=E4genstedt?= philip@foolip.org CC: David Gerard dgerard@gmail.com Subject: [OFFLIST] Re: [Wikitech-l] RDFa and Microdata in MediaWiki References: 95E5D943-BD84-4028-8FDA-8CBC7D4CE587@iki.fi 4B54E525.8020100@digitalbazaar.com fbad4e141001181524t4c0622d6i4689b32510c6f426@mail.gmail.com In-Reply-To: fbad4e141001181524t4c0622d6i4689b32510c6f426@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
David Gerard wrote:
2010/1/18 Manu Sporny msporny@digitalbazaar.com:
Rather than argue the same FUD for Microdata, which anybody could, I This is FUD. You are asserting your opinion without making any sort
of
Please don't do this sort of thing. You were asked once already not do.
Hey Aryeh, Henri, Philip,
Looks like I've had my hand slapped twice during this discussions. I thought this was the first warning, but David seems to think differently. That means that either I've been too aggressive or David is not familiar with the level of intensity surrounding the Microdata/RDFa debates. Both of these are negative outcomes in my mind.
In the first case, know that I am not, nor ever intend to personally attack any of you. I strongly disagree with the idea of getting Microdata integrated with Wikipedia at this stage, before REC, but it is the idea that I am attacking, not any of you.
If any of you took any of my statements personally, specifically Aryeh or Henri, I whole-heartedly apologize. I know each of you are doing what you feel is best for the future of the Web, just as I am.
In the second case, if David feels the need to reel the discussion in, that probably means that others in the community are not reacting well to the discussion either. This is bad for both Microdata and RDFa as it sours those that are involved to using either solution.
I think I've made the points that I wanted to make and then some, so, I'm going to back off on replying to most of the discussion. I'll chime in when somebody asks a direct question of me, but will eventually exit the discussion entirely. Please do your best to fact check what you guys are saying and be careful of how you phrase assertions about the past and future RDFa work - things are never as clear cut as they may seem.
I hope this e-mail finds all of you in good health,
-- manu
PS: Tell Lachy that I hope his arm feels better soon. :)
On Wed, Jan 20, 2010 at 7:10 PM, Manu Sporny msporny@digitalbazaar.com wrote:
On Mon, Jan 18, 2010 at 7:47 PM, Manu Sporny msporny@digitalbazaar.com wrote:
Looks like I've had my hand slapped twice during this discussions. I thought this was the first warning, but David seems to think differently. That means that either I've been too aggressive or David is not familiar with the level of intensity surrounding the Microdata/RDFa debates.
That veiled insults and questioning others' motives is par for the course on public-html doesn't mean we're going to tolerate it here. It shouldn't happen there either, of course, but we can't help that.
I strongly disagree with the idea of getting Microdata integrated with Wikipedia at this stage, before REC
This is just not a reasonable position to take outside the ivory tower of standards-making.
Aryeh, you're quoting something that I purposefully said off-list in an attempt to save this mailing list from the RDFa/Microdata tumult.
It was an e-mail that I specifically sent to you, Henri, Philip and David Gerard. That e-mail, which contained a heart-felt apology (in case I had come across as too aggressive) has now been openly quoted out of context. I say this because that e-mail was specifically off-list and thus nobody on the list could discover the full context of the statements above. The off-list e-mail is attached below in order to further explain why I won't be participating in this discussion.
Here are the sequence of events that lead us to this point:
- Aryeh Gregor posts a call to integrate Microdata into Mediawiki[1].
- Manu Sporny, noticing a number of factual errors, responds[2] to the
initial e-mail. The focus of the arguments are around the non-maturity of both the Microdata and HTML5+RDFa proposals.
- A number of e-mails are exchanged during which Manu Sporny is warned
to not use the term FUD or other charged language by David Gerard.
- Manu Sporny, concerned that the debate is damaging both of the
Microdata and RDFa initiatives, decides to apologize and discuss the issue off-list with the respondents.
- Aryeh Gregor quotes portions of the off-list discussion, out of
context, and responds[3] directly to the mailing list.
- Manu Sporny responds with this e-mail.
[1]http://lists.wikimedia.org/pipermail/wikitech-l/2010-January/046382.html [2]http://lists.wikimedia.org/pipermail/wikitech-l/2010-January/046386.html [3]http://lists.wikimedia.org/pipermail/wikitech-l/2010-January/046466.html
I will be responding shortly to the remaining questions that have been unanswered during this discussion and then leaving the discussion entirely. I don't feel that we are having a productive discussion here and the damage that I fear is resulting is the rejection of both Microdata and RDFa.
If there are any further questions related to RDFa, please ask them on the RDFa Task Force mailing list (a public mailing list) and we will do the best that we can to answer them:
http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/
Below is the off-list e-mail that I sent to Aryeh, Philip, Henri and David.
Message-ID: 4B55010B.6090607@digitalbazaar.com Date: Mon, 18 Jan 2010 19:47:07 -0500 From: Manu Sporny msporny@digitalbazaar.com User-Agent: Mozilla-Thunderbird 2.0.0.19 (X11/20090103) MIME-Version: 1.0 To: Henri Sivonen hsivonen@iki.fi, Aryeh Gregor Simetrical+wikilist@gmail.com, =?ISO-8859-1?Q?Philip_J=E4genstedt?= philip@foolip.org CC: David Gerard dgerard@gmail.com Subject: [OFFLIST] Re: [Wikitech-l] RDFa and Microdata in MediaWiki References: 95E5D943-BD84-4028-8FDA-8CBC7D4CE587@iki.fi 4B54E525.8020100@digitalbazaar.com fbad4e141001181524t4c0622d6i4689b32510c6f426@mail.gmail.com In-Reply-To: fbad4e141001181524t4c0622d6i4689b32510c6f426@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
David Gerard wrote:
2010/1/18 Manu Sporny msporny@digitalbazaar.com:
Rather than argue the same FUD for Microdata, which anybody could, I This is FUD. You are asserting your opinion without making any sort
of
Please don't do this sort of thing. You were asked once already not do.
Hey Aryeh, Henri, Philip,
Looks like I've had my hand slapped twice during this discussions. I thought this was the first warning, but David seems to think differently. That means that either I've been too aggressive or David is not familiar with the level of intensity surrounding the Microdata/RDFa debates. Both of these are negative outcomes in my mind.
In the first case, know that I am not, nor ever intend to personally attack any of you. I strongly disagree with the idea of getting Microdata integrated with Wikipedia at this stage, before REC, but it is the idea that I am attacking, not any of you.
If any of you took any of my statements personally, specifically Aryeh or Henri, I whole-heartedly apologize. I know each of you are doing what you feel is best for the future of the Web, just as I am.
In the second case, if David feels the need to reel the discussion in, that probably means that others in the community are not reacting well to the discussion either. This is bad for both Microdata and RDFa as it sours those that are involved to using either solution.
I think I've made the points that I wanted to make and then some, so, I'm going to back off on replying to most of the discussion. I'll chime in when somebody asks a direct question of me, but will eventually exit the discussion entirely. Please do your best to fact check what you guys are saying and be careful of how you phrase assertions about the past and future RDFa work - things are never as clear cut as they may seem.
I hope this e-mail finds all of you in good health,
-- manu
PS: Tell Lachy that I hope his arm feels better soon. :)
-- Manu Sporny (skype: msporny, twitter: manusporny) President/CEO - Digital Bazaar, Inc. blog: Monarch - Next Generation REST Web Services http://blog.digitalbazaar.com/2009/12/14/monarch/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
(Sorry for the double-posting, wanted to make sure this got in the correct thread)
At this point, all I see is a discussion between two technologies that are about equally difficult to implement for MediaWiki, provide roughly the same benefits, varying largely in the semantics of how it's presented. In any case, I'm inclined to agree with Happy-Melon on this issue, and I think we're going about it in the wrong way.
If we've got access to this metadata, then sure, it should be distributed in as many formats as people show a desire to consume. This could be RDFa, Microdata, or anything. Right now though, we do not have this metadata. All we have is templates. Trying to extract this data from templates (or by extension, parser/tag functions) is approaching the problem from the wrong direction. It still relies on input of wikitext into the edit form. We need to remember that wikitext is a markup language designed with presentation in mind, not semantic data. This sort of page metadata (licenses, categories, etc) needs to be kept out of the edit page entirely.
-Chad
Henri Sivonen wrote:
The above could be marked up in RDFa, with pre-defined vocabs, like so:
It should be noted that the concept of "pre-defined vocabs" is neither in the HTML+RDFa draft nor in the RDFa in XHTML spec from the XHTML2 WG.
When I said "pre-defined vocabs", I was refering to "using xmlns: at the top of the document to declare prefixes that will be used in the rest of the document". I was specifically referring to technology that already exists as a W3C Recommendation.
Henri Sivonen wrote:
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg" >
typeof="dctype:StillImage"> > <span property="dc:title">Emery Molyneux Terrestrial Globe</span> > by <a rel="cc:attributionUrl" href=" > http://example.org/bob/" > > property="cc:attributionName">Bob Smith</span> > is licensed under a <a rel="license" > href="
http://creativecommons.org/licenses/by-sa/3.0/us/" > >Creative >
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
Hiding the CURIE declarations is a common pattern when advocating RDFa: It makes RDFa appear tidier than it is. To write this in RDFa in XHTML (the RDFa spec you say is safe to use for deployment), one would need to declare the CURIE prefixes:
It is a common pattern because deployment experience has shown us that prefix declaration usually happens once, at the top of a document. See the source for Digg[1], The Public Library of Science[2], Drupal 7[3] for examples of how this is done on live sites today.
This is orthogonal to how web authors include scripts and CSS at the top of their documents.
Henri Sivonen wrote:
However - XHTML1+RDFa is a published W3C Recommendation and it is safe
to use it for deployment.
RDFa in XHTML has indeed been published as a Recommendation jointly by the Semantic Web Deployment Working Group and the XHTML2 Working Group. However, you fail to mention that even though the document mentions "HTML" in its first sentence, all the normative matter concerns strictly XHTML and the document has gone through the W3C Process as a specification that applies to XML.
RDFa was designed to work in XHTML and HTML. The RDF in XHTML Task Force, which produced XHTML1+RDFa, was only chartered to realize the language in XHTML. Had we been chartered to work on HTML5, which wasn't even an official W3C work product at the time, we would have done so.
We are currently working on ensuring that markup in both XHTML and HTML remains identical so that all XHTML1+RDFa will continue to be interpreted properly in HTML5.
Henri Sivonen wrote:
MediaWiki uses the text/html and, thus, its pages get processed as HTML, so it would be inappropriate to rely on a spec that had been reviewed as an XML spec.
MediaWiki's pages only get processed as HTML by web browsers. The Web is more than web browsers - search engine companies, for example, often process the document based on the received DOCTYPE. While that document is served as "text/html", it is validated as XHTML 1.0 Transitional by the W3C.
It is a goal of the RDFa Working Group to ensure that any document that is XHTML1 valid, served as "text/html", produces the same triples as if it were served as "application/xhtml+xml". As a general rule, that is a goal that the current parsers meet and that we will ensure to codify in HTML5+RDFa.
Henri Sivonen wrote:
Furthermore, the ease of getting a spec to REC at the W3C depends on how many people are interested in the spec. The more people are interested in a spec, the more review comments there are. The flip side is that when there's *less* interest in a spec, it's easier to get it to Recommendation due to fewer comments raised. Thus, progress along the REC track isn't a commensurable indicator of technical merit or technical maturity across different specs and WGs.
This is a red herring - any published technology will always have detractors that don't like it and claim such things as "well, I wouldn't call that a spec because of personal opinion X.". XHTML+RDFa is a REC, and because it is a REC, we can speak with authority that the spec is not going to change and can be used as-is.
XHTML+RDFa had a great deal of review - just check the mailing list reviews and number of implementations (over 8 at last count). The fact that there are people using RDFa and so few errata for the XHTML+RDFa spec thus far is proof of the substantial review that it underwent.
Rather than argue the same FUD for Microdata, which anybody could, I suggest that we focus on technical merits.
Henri Sivonen wrote:
Also, when assessing the "safe" deployability of RDFa in XHTML, it's relevant to consider that
- RDFa in XHTML was knowingly (see
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-August/015913.html) progressed on the Recommendation track without resolving how RDFa works with HTML first.
Not True. We were very aware of HTML4 - but were not chartered to work on HTML5. How RDFa would work in HTML4 was brought up and considered frequently. HTML5 was barely on the radar for most of RDFa's development. We even went as far as developing a DTD for HTML4+RDFa - but there was no avenue to publish it at W3C since all work on HTML4 had ceased.
The fact that you can point fully-conforming Javascript RDFa processors (such as ubiquity-rdfa or rdfquery) at an HTML4 or HTML5 document containing RDFa and get data out is proof that we went out of our way to ensure a universal markup mechanism for semantic data.
Keep in mind that when RDFa was being developed, HTML5 wasn't even a W3C work product. Now that it is, we have an updated HTML5+RDFa spec (which, by the way, hasn't required a single change to the RDFa processing rules).
Henri Sivonen wrote:
- An RDFa 1.1 is in the works, and the changes
being considered make RDFa 1.0 look like a beta release. (Which is understandable, since a good part of the technical review of RDFa has occurred after RDFa in XHTML was rushed to REC.)
This is FUD. You are asserting your opinion without making any sort of technical argument. The larger changes being considered are feature additions. Let's stick to the technical arguments that will impact Wikipedia rather than devolve into how each of us view the ridiculously convoluted process that got RDFa and Microdata to where they are today.
Automatic XML Literals are the only thing that may not be backwards-compatible in RDFa 1.1, and Wikipedia can guard against this by ensuring that they follow this one rule:
* If you want to express any data as an XML Literal, make sure that you use datatype="rdf:XMLLiteral".
All other changes are feature additions based on community requests - some of which, both Aryeh and you requested. :)
-- manu
[1]http://digg.com/politics/Barack_Obama_Officialy_Becomes_44th_American_Presid... [2]http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000275 [3]http://drupalrdf.openspring.net/node/106
2010/1/18 Manu Sporny msporny@digitalbazaar.com:
Rather than argue the same FUD for Microdata, which anybody could, I This is FUD. You are asserting your opinion without making any sort of
Please don't do this sort of thing. You were asked once already not do.
- d.
wikitech-l@lists.wikimedia.org