Re: [Wikitech-l] RDFa and Microdata in MediaWiki

20 Jan 2010


      On 01/20/2010 04:47 PM, Happy-melon wrote:
...
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message 
news:7c2a12e21001200638y759365c8oeecd8f06f761a583@mail.gmail.com...
...
On Mon, Jan 18, 2010 at 7:34 PM, Happy-melon happy-melon@live.com wrote:
I bet very few people would bother adding metadata without a concrete
use.  And they'd probably get into fights with other people annoyed at
them for making it harder to edit wikitext.  This would all be
irrelevant if we only supported a few whitelisted vocabularies,
though, as the current microdata implementation does.  We should
encourage bulky and not-so-useful stuff to go in a separate stream.
Yes, very few people would bother.  Those few people would still introduce a 
monstrous amount of extra markup by working deep in the template stack. 
Doesn't take much to add kilobytes to large articles; I've added 5kb to 
[[Barack Obama]] myself just by adding a span round reference brackets. 
Just adding author metadata to citation templates would add seconds to load 
times for large articles.
...
...
I would say it's
definitely 'worth' exposing license metadata on every use of an image; 
the
status of a page's images affects our whole terms of use, whether we can 
say
"yes you can use all this in this fashion" verses "you have to jump 
through
these hoops for these images because they're different".  Author, 
location,
capture date; yes these probably aren't 'worth' the cost of exposing on
pages.  But being able to search commons for all photos taken in Berlin
between 1989 and 1991 would be worth its weight in gold.
Sure -- but that can be exposed in a separate data stream, since
...
99.9% of page views won't need it.
I'm not talking about exposing it in a data stream per se, I'm suggesting 
that that's what our internal search would be able to achieve if the 
metadata was accessible to MediaWiki.
...
...
Indeed, but that's data *output*, not input.  Currently our categories 
are
input via [[Category:Foo]] and output via some HTML at the bottom of the
page, but also via the API in a variety of formats; people use both 
methods
to extract the metadata.  Once MW knows what data an object has, how it
outputs that data back is totally open as you say.  So given that a
translation into a format that MW understands is desirable for its own 
sake,
and that from there it's trivial to translate back into whatever output
format(s) the current web demands, why would we choose an input format 
like
<span xmlns:dc="http://purl.org/dc/elements/1.1/"
href="http://purl.org/dc/dcmitype/StillImage" property="dc:title"
rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image"
property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span>
is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>
Rather than an input format like [[License::CC-BY-SA-3.0]]??
First, why are you asking me why we would choose RDFa when I don't
think we should?  At least quote microdata.
Second, this is apples to oranges.  Your RDFa sample a) says that the
work is a still image, b) gives its name, c) gives the author's name,
d) gives the URL of the license, e) contains user-visible prose.  Your
wikitext sample just gives the license name (not even a license URL!).
No kidding the latter is shorter.  A more realistic comparison might
be
<p><span 
itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
by <span itemprop="author">Bob Smith</span> is licensed under a <a
itemprop="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
vs.
<p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]]
by [[author::Bob Smith|]] is licensed under a
[[license::http://creativecommons.org/licenses/by-sa/3.0/us/|[http://creativecommons.org/licenses/by-sa/3.0/us/
Creative
Commons Attribution-Share Alike 3.0 United States License]]].</p>
or something, which is not such an easy call.  The wikitext is not
that much shorter or simpler -- particularly when you account for the
fact that you'd have to separately define mappings to concrete
microdata/RDFa/RDF vocabularies for output.  (Yes, I left out the
itemtype on the microdata, but again, that would have to be defined
somewhere for the wikisyntax too.)
True, the markup Dmitry offered is more suitable.  But Ryan is absolutely 
right.  You're only thinking about the the *current* generation of formats, 
and assuming (maybe legitimately, I don't know) that microdata is the best 
format for us to use.  What happens when the next generation of format(s) 
come out?  With a format-neutral input format, MW sites can quickly adapt to 
accommodate it.  Plus this method of data-injection will much more work to 
allow MW to extract the data from the wikitext, which puts our searching for 
photos in Berlin issue further out of reach.
You could say that we're talking about different things again; that you're 
talking about marking up data for external use.  But there's no reason why a 
{{#prop:foo|bar}} magic word can't *also* output some appropriate metadata 
format into the wikitext.  Marking up in a format-neutral syntax allows us 
to output metadata from wikitext *and* from MW generally, and to change 
*both* formats at the drop of a hat.  Marking up in a particular format, 
whatever the format is, makes it damn near impossible (or at least 
hopelessly hackish) to change wikitext output from one format to another, 
and equally horrible for MW to collect data at all.
I do not like the idea of having a parser function that outputs the data
into the article - if people want the meta-data they can query it from
an API, or a dump, as opposed to screen-scraping. Perhaps meta-data on
image pages is useful, but if someone wants to get licenses of all the
images, surely providing a single file containing all is better than
screen-scraping for it (even RDFa/microdata is screen scraping, in my
opinion; it's just done with the hope that a developer has made it easy
for you - you will still have to deal with invalid uses of markup, and
the more complicated the markup, the more it will be used invalidly).
I would not be against using whitelisting necessary attributes to allow
wikis to put in these formats manually.
I do like the idea (a lot) of having a parser function that can put data
into a storage model inside MediaWiki (probably tabular, ideally
relational) that can be dumped like the current articles or queried
using the API. My original thoughts [0] had the wiki's technocrat's
define a few "tables" which could be populated with the {{#store}} command.
Conrad
[0] http://en.wiktionary.org/w/index.php?oldid=6304302
...
--HM

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RDFa and Microdata in MediaWiki