Sorry, just want to clarify:
It would be easy to get images into Google Image Search.
It would be hard to get correct licenses into Google Image Search, given
the current situation. We'd need to do some serious rethinking on our end.
On 10/13/11 1:58 PM, Neil Kandalgaonkar wrote:
On 10/13/11 1:35 PM, Rayson Ho wrote:
On Thu, Oct 13, 2011 at 3:56 PM, Neil
Kandalgaonkar<neilk(a)wikimedia.org> wrote:
What exactly needs to be done??
1) Figure out some scheme whereby the actual license is available in the
database, not merely expressed in human-readable HTML. This is hard, so
it's where I gave up. Timo (Krinkle) and Roan Kattouw were working on
this for a bit but they were pulled off to do other things. To do this
right we'd create a new namespace like License: and then connect that to
some database entity. We could then connect those to existing license
templates.
2) Once that's done, from a daily cronjob or something, generate
Sitemaps (a summary of all our content) compatible with the Google Image
Search extended syntax.
http://www.google.com/support/webmasters/bin/answer.py?answer=178636
A less organized dump from my brain here:
http://www.mediawiki.org/wiki/User:NeilK/Sitemaps
Can't Google just parse the licensing
section to decide if the image is under CC or not??
Google *can* do any number of things. However, they will probably not do
any custom development work for Commons.
By 2011 standards, Commons is a relatively small image repository.
Flickr has billions of images, and it's not even the most popular photo
host. Facebook, although inaccessible to Google, adds several billion
images to its repository *per week*.
Commons may have some of the "best of the web" images for illustration
purposes, so it is a high-value thing for Google to crawl. So yeah, it's
enough for them to assign a guy to talk to me every few months or so.
But not enough that they will assign developers. They wouldn't have even
bothered pinging us if that were the case.
Commons has no real way to communicate licenses to Google. Templates
create human-readable HTML, not machine-parseable legal information. If
someone edited the CC master template tomorrow to look a bit prettier,
anything that was trying to parse licenses from HTML would break.
Google has a standard for us to tell them the license, in the extended
Sitemap syntax for images, linked to above. That's what we should do,
because it would make that information available to Google, and
potentially to any other search engines that can read that standard.
If Google can find more of my 400+ images, then
they can be used by
others more often... that would certainly make me work harder to take
more photos& upload more to Commons!
Hell yes!