On 01/20/2010 04:47 PM, Happy-melon wrote:
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e21001200638y759365c8oeecd8f06f761a583@mail.gmail.com...
On Mon, Jan 18, 2010 at 7:34 PM, Happy-melon happy-melon@live.com wrote:
I bet very few people would bother adding metadata without a concrete use. And they'd probably get into fights with other people annoyed at them for making it harder to edit wikitext. This would all be irrelevant if we only supported a few whitelisted vocabularies, though, as the current microdata implementation does. We should encourage bulky and not-so-useful stuff to go in a separate stream.
Yes, very few people would bother. Those few people would still introduce a monstrous amount of extra markup by working deep in the template stack. Doesn't take much to add kilobytes to large articles; I've added 5kb to [[Barack Obama]] myself just by adding a span round reference brackets. Just adding author metadata to citation templates would add seconds to load times for large articles.
I would say it's definitely 'worth' exposing license metadata on every use of an image; the status of a page's images affects our whole terms of use, whether we can say "yes you can use all this in this fashion" verses "you have to jump through these hoops for these images because they're different". Author, location, capture date; yes these probably aren't 'worth' the cost of exposing on pages. But being able to search commons for all photos taken in Berlin between 1989 and 1991 would be worth its weight in gold.
Sure -- but that can be exposed in a separate data stream, since
99.9% of page views won't need it.
I'm not talking about exposing it in a data stream per se, I'm suggesting that that's what our internal search would be able to achieve if the metadata was accessible to MediaWiki.
Indeed, but that's data *output*, not input. Currently our categories are input via [[Category:Foo]] and output via some HTML at the bottom of the page, but also via the API in a variety of formats; people use both methods to extract the metadata. Once MW knows what data an object has, how it outputs that data back is totally open as you say. So given that a translation into a format that MW understands is desirable for its own sake, and that from there it's trivial to translate back into whatever output format(s) the current web demands, why would we choose an input format like
<span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/StillImage" property="dc:title" rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image" property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>
Rather than an input format like [[License::CC-BY-SA-3.0]]??
First, why are you asking me why we would choose RDFa when I don't think we should? At least quote microdata.
Second, this is apples to oranges. Your RDFa sample a) says that the work is a still image, b) gives its name, c) gives the author's name, d) gives the URL of the license, e) contains user-visible prose. Your wikitext sample just gives the license name (not even a license URL!). No kidding the latter is shorter. A more realistic comparison might be
<p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span> by <span itemprop="author">Bob Smith</span> is licensed under a <a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative Commons Attribution-Share Alike 3.0 United States License</a>.</p>
vs.
<p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]] by [[author::Bob Smith|]] is licensed under a [[license::http://creativecommons.org/licenses/by-sa/3.0/us/|[http://creativecommons.org/licenses/by-sa/3.0/us/ Creative Commons Attribution-Share Alike 3.0 United States License]]].</p>
or something, which is not such an easy call. The wikitext is not that much shorter or simpler -- particularly when you account for the fact that you'd have to separately define mappings to concrete microdata/RDFa/RDF vocabularies for output. (Yes, I left out the itemtype on the microdata, but again, that would have to be defined somewhere for the wikisyntax too.)
True, the markup Dmitry offered is more suitable. But Ryan is absolutely right. You're only thinking about the the *current* generation of formats, and assuming (maybe legitimately, I don't know) that microdata is the best format for us to use. What happens when the next generation of format(s) come out? With a format-neutral input format, MW sites can quickly adapt to accommodate it. Plus this method of data-injection will much more work to allow MW to extract the data from the wikitext, which puts our searching for photos in Berlin issue further out of reach.
You could say that we're talking about different things again; that you're talking about marking up data for external use. But there's no reason why a {{#prop:foo|bar}} magic word can't *also* output some appropriate metadata format into the wikitext. Marking up in a format-neutral syntax allows us to output metadata from wikitext *and* from MW generally, and to change *both* formats at the drop of a hat. Marking up in a particular format, whatever the format is, makes it damn near impossible (or at least hopelessly hackish) to change wikitext output from one format to another, and equally horrible for MW to collect data at all.
I do not like the idea of having a parser function that outputs the data into the article - if people want the meta-data they can query it from an API, or a dump, as opposed to screen-scraping. Perhaps meta-data on image pages is useful, but if someone wants to get licenses of all the images, surely providing a single file containing all is better than screen-scraping for it (even RDFa/microdata is screen scraping, in my opinion; it's just done with the hope that a developer has made it easy for you - you will still have to deal with invalid uses of markup, and the more complicated the markup, the more it will be used invalidly).
I would not be against using whitelisting necessary attributes to allow wikis to put in these formats manually.
I do like the idea (a lot) of having a parser function that can put data into a storage model inside MediaWiki (probably tabular, ideally relational) that can be dumped like the current articles or queried using the API. My original thoughts [0] had the wiki's technocrat's define a few "tables" which could be populated with the {{#store}} command.
Conrad
[0] http://en.wiktionary.org/w/index.php?oldid=6304302
--HM
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l