Re: [Wikitech-l] RDFa and Microdata in MediaWiki

20 Jan 2010

"Aryeh Gregor" &lt;Simetrical+wikilist(a)gmail.com&gt; wrote in message 
news:7c2a12e21001200638y759365c8oeecd8f06f761a583@mail.gmail.com...
...
  On Mon, Jan 18, 2010 at 7:34 PM, Happy-melon
&lt;happy-melon(a)live.com&gt; wrote:

 I bet very few people would bother adding metadata without a concrete
 use.  And they'd probably get into fights with other people annoyed at
 them for making it harder to edit wikitext.  This would all be
 irrelevant if we only supported a few whitelisted vocabularies,
 though, as the current microdata implementation does.  We should
 encourage bulky and not-so-useful stuff to go in a separate stream. 
Yes, very few people would bother.  Those few people would still introduce a 
monstrous amount of extra markup by working deep in the template stack. 
Doesn't take much to add kilobytes to large articles; I've added 5kb to 
[[Barack Obama]] myself just by adding a span round reference brackets. 
Just adding author metadata to citation templates would add seconds to load 
times for large articles.

...
   I would say
it's
 definitely 'worth' exposing license metadata on every use of an image; 
 the
 status of a page's images affects our whole terms of use, whether we can 
 say
 "yes you can use all this in this fashion" verses "you have to jump 
 through
 these hoops for these images because they're different".  Author, 
 location,
 capture date; yes these probably aren't 'worth' the cost of exposing on
 pages.  But being able to search commons for all photos taken in Berlin
 between 1989 and 1991 would be worth its weight in gold. 
 Sure -- but that can be exposed in a separate data stream, since
>99.9% of page views won't need it. 
I'm not talking about exposing it in a data stream per se, I'm suggesting 
that that's what our internal search would be able to achieve if the 
metadata was accessible to MediaWiki.

...
   Indeed, but
that's data *output*, not input.  Currently our categories 
 are
 input via [[Category:Foo]] and output via some HTML at the bottom of the
 page, but also via the API in a variety of formats; people use both 
 methods
 to extract the metadata.  Once MW knows what data an object has, how it
 outputs that data back is totally open as you say.  So given that a
 translation into a format that MW understands is desirable for its own 
 sake,
 and that from there it's trivial to translate back into whatever output
 format(s) the current web demands, why would we choose an input format 
 like

 <span xmlns:dc="http://purl.org/dc/elements/1.1/"
 href="http://purl.org/dc/dcmitype/StillImage" property="dc:title"
 rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
 by <span xmlns:cc="http://creativecommons.org/ns#"
href="#mw-image"
 property="cc:attributionName" rel="cc:attributionURL">Bob
Smith</span>
 is licensed under a <a rel="license"
 href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creat…
 Commons Attribution-Share Alike 3.0 United States License</a>

 Rather than an input format like [[License::CC-BY-SA-3.0]]?? 
 First, why are you asking me why we would choose RDFa when I don't
 think we should?  At least quote microdata.

 Second, this is apples to oranges.  Your RDFa sample a) says that the
 work is a still image, b) gives its name, c) gives the author's name,
 d) gives the URL of the license, e) contains user-visible prose.  Your
 wikitext sample just gives the license name (not even a license URL!).
 No kidding the latter is shorter.  A more realistic comparison might
 be

 <p><span 

itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
 by <span itemprop="author">Bob Smith</span> is licensed under a
<a
 itemprop="license"
 href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creat…
 Commons Attribution-Share Alike 3.0 United States License</a>.</p>

 vs.

 <p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]]
 by [[author::Bob Smith|]] is licensed under a

[[license::http://creativecommons.org/licenses/by-sa/3.0/us/|[http://creative…
 Creative
 Commons Attribution-Share Alike 3.0 United States License]]].</p>

 or something, which is not such an easy call.  The wikitext is not
 that much shorter or simpler -- particularly when you account for the
 fact that you'd have to separately define mappings to concrete
 microdata/RDFa/RDF vocabularies for output.  (Yes, I left out the
 itemtype on the microdata, but again, that would have to be defined
 somewhere for the wikisyntax too.) 
True, the markup Dmitry offered is more suitable.  But Ryan is absolutely 
right.  You're only thinking about the the *current* generation of formats, 
and assuming (maybe legitimately, I don't know) that microdata is the best 
format for us to use.  What happens when the next generation of format(s) 
come out?  With a format-neutral input format, MW sites can quickly adapt to 
accommodate it.  Plus this method of data-injection will much more work to 
allow MW to extract the data from the wikitext, which puts our searching for 
photos in Berlin issue further out of reach.

You could say that we're talking about different things again; that you're 
talking about marking up data for external use.  But there's no reason why a 
{{#prop:foo|bar}} magic word can't *also* output some appropriate metadata 
format into the wikitext.  Marking up in a format-neutral syntax allows us 
to output metadata from wikitext *and* from MW generally, and to change 
*both* formats at the drop of a hat.  Marking up in a particular format, 
whatever the format is, makes it damn near impossible (or at least 
hopelessly hackish) to change wikitext output from one format to another, 
and equally horrible for MW to collect data at all.

--HM

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RDFa and Microdata in MediaWiki