On 10/13/2011 5:02 PM, David Gerard wrote:
On 13 October 2011 21:58, Neil
Kandalgaonkar<neilk(a)wikimedia.org> wrote:
> Commons has no real way to communicate licenses to Google. Templates
> create human-readable HTML, not machine-parseable legal information. If
> someone edited the CC master template tomorrow to look a bit prettier,
> anything that was trying to parse licenses from HTML would break.
I'm going to say that this is B.S. It seems everybody in
business thinks it's easy to write GUI applications (where you really
spend four months rewriting the requirements again and again and doing
testing that never ends) and hard to write screen scrapers. (where you
sometimes get it to work in four minutes)
I built a rather complicated system that reads the Wiki markup
and extracts a whole bunch of metadata. This system was fairly accurate
but eventually reached a plateau of what it could do. It had trouble
extracting licenses all the time because templates are wrapped up inside
of templates which are wrapped up inside of templates and so on.
The old system often had to deal with contradictory data -- for
instance, there's a certain guy who uses {self} templates on photos
that came from Flickr. Nobody really noticed that there's a problem
here because the HTML markup looks superficially O.K. The issue is that
HTML output on Commons is tested every day, and the ability to get
semantics out of the inner markup doesn't get tested. "Fifth wheel"
features (microformats, etc.) are even more likely to break without
being noticed since nobody actually uses them...
Later on I developed a much simpler heuristic: extract all
hyperlinks from the HTML and filter for links that point to licenses.
For instance,
http://commons.wikimedia.org/wiki/File:2011-03-09-fort-du-lomont-10.jpg
has a link to
http://creativecommons.org/licenses/by/3.0/deed.en
This is as easy to read as any kind of structured metadata could
ever be. And it's not a "fifth wheel", it's actually visible in the
HTML markup, so if it's wrong people will notice.