Hi,
There is an interesting Firefox extension called Zemanta, that works with some blogging platforms, to suggest images to match a blog post you type. One of the sources they use is Commons. See this post (comments) for a description of how it works and what it's lacking: http://brianna.modernthings.org/article/97/zemanta-wikimedia-commons-for-bloggers
In particular, "If you have an idea how to correctly capture wikipedia images attribution (something that would assure at least 50% correct coverage from 2.8M images), please help us! ;)"
Really, we can't blame people too much for not providing attribution, when we don't give that information in a standard way, or give a standard way of accessing it.
Now is as good a time as any to formally write an API to recommend for other people to use. Aside from the MediaWiki API, there are three main things I can think of that are often needed to be automated: * identify any "problem tags" (files with deletion markers shouldn't be used or indexed by third parties) * extract license name(s) and URL for a given file * extract author attribution string for a given file
So I propose we put our heads together and figure out the most robust algorithm for each of these, and provide some sample code for each.
I made a start here:
http://commons.wikimedia.org/wiki/Commons:API
Contributions and feedback welcome...
cheers, Brianna