Re: [Commons-l] Blog entry about finding free images

14 Oct 2011

  On 10/13/2011 5:02 PM, David Gerard wrote:
...
  On 13 October 2011 21:58, Neil
Kandalgaonkar&lt;neilk(a)wikimedia.org&gt;  wrote:

> Commons has no real way to communicate licenses to Google. Templates
> create human-readable HTML, not machine-parseable legal information. If
> someone edited the CC master template tomorrow to look a bit prettier,
> anything that was trying to parse licenses from HTML would break. 
        I'm going to say that this is B.S.  It seems everybody in 
business thinks it's easy to write GUI applications (where you really 
spend four months rewriting the requirements again and again and doing 
testing that never ends) and hard to write screen scrapers.  (where you 
sometimes get it to work in four minutes)

        I built a rather complicated system that reads the Wiki markup 
and extracts a whole bunch of metadata.  This system was fairly accurate 
but eventually reached a plateau of what it could do.  It had trouble 
extracting licenses all the time because templates are wrapped up inside 
of templates which are wrapped up inside of templates and so on.

        The old system often had to deal with contradictory data -- for 
instance,  there's a certain guy who uses {self} templates on photos 
that came from Flickr.  Nobody really noticed that there's a problem 
here because the HTML markup looks superficially O.K.  The issue is that 
HTML output on Commons is tested every day,  and the ability to get 
semantics out of the inner markup doesn't get tested.  "Fifth wheel" 
features (microformats,  etc.) are even more likely to break without 
being noticed since nobody actually uses them...

        Later on I developed a much simpler heuristic:  extract all 
hyperlinks from the HTML and filter for links that point to licenses.  
For instance,

http://commons.wikimedia.org/wiki/File:2011-03-09-fort-du-lomont-10.jpg

has a link to

http://creativecommons.org/licenses/by/3.0/deed.en

        This is as easy to read as any kind of structured metadata could 
ever be.  And it's not a "fifth wheel",  it's actually visible in the 
HTML markup,  so if it's wrong people will notice.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] Blog entry about finding free images