Sorry, just want to clarify:
It would be easy to get images into Google Image Search.
It would be hard to get correct licenses into Google Image Search, given the current situation. We'd need to do some serious rethinking on our end.
On 10/13/11 1:58 PM, Neil Kandalgaonkar wrote:
On 10/13/11 1:35 PM, Rayson Ho wrote:
On Thu, Oct 13, 2011 at 3:56 PM, Neil Kandalgaonkarneilk@wikimedia.org wrote:
What exactly needs to be done??
- Figure out some scheme whereby the actual license is available in the
database, not merely expressed in human-readable HTML. This is hard, so it's where I gave up. Timo (Krinkle) and Roan Kattouw were working on this for a bit but they were pulled off to do other things. To do this right we'd create a new namespace like License: and then connect that to some database entity. We could then connect those to existing license templates.
- Once that's done, from a daily cronjob or something, generate
Sitemaps (a summary of all our content) compatible with the Google Image Search extended syntax.
http://www.google.com/support/webmasters/bin/answer.py?answer=178636
A less organized dump from my brain here: http://www.mediawiki.org/wiki/User:NeilK/Sitemaps
Can't Google just parse the licensing section to decide if the image is under CC or not??
Google *can* do any number of things. However, they will probably not do any custom development work for Commons.
By 2011 standards, Commons is a relatively small image repository. Flickr has billions of images, and it's not even the most popular photo host. Facebook, although inaccessible to Google, adds several billion images to its repository *per week*.
Commons may have some of the "best of the web" images for illustration purposes, so it is a high-value thing for Google to crawl. So yeah, it's enough for them to assign a guy to talk to me every few months or so. But not enough that they will assign developers. They wouldn't have even bothered pinging us if that were the case.
Commons has no real way to communicate licenses to Google. Templates create human-readable HTML, not machine-parseable legal information. If someone edited the CC master template tomorrow to look a bit prettier, anything that was trying to parse licenses from HTML would break.
Google has a standard for us to tell them the license, in the extended Sitemap syntax for images, linked to above. That's what we should do, because it would make that information available to Google, and potentially to any other search engines that can read that standard.
If Google can find more of my 400+ images, then they can be used by others more often... that would certainly make me work harder to take more photos& upload more to Commons!
Hell yes!