On Thu, Oct 13, 2011 at 9:58 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:
Google has a standard for us to tell them the license, in the extended Sitemap syntax for images, linked to above. That's what we should do, because it would make that information available to Google, and potentially to any other search engines that can read that standard.
I have created a preliminary sitemap file for Commons on the toolserver.
I use categories to find licenses, currently CC-BY-SA, CC-BY, GFDL, and PD. This can assign 9,355,602 of our 11.3M files at least one license. (There might be multiple entries for the same file in there, though.) It's farm from complete, but a reasonable start IMHO.
For those with toolserver access, the file is here (300MB gzipped): /mnt/user-store/magnus/commons.sitemap.gz
Generation took 38 minutes. Script (hereby under GFDL) is here: /home/magnus/commons_sitemap/make_sitemap.pl (utilizing /home/magnus/sql_quick )
Magnus