On Thu, Oct 13, 2011 at 9:58 PM, Neil Kandalgaonkar <neilk(a)wikimedia.org> wrote:
Google has a standard for us to tell them the license,
in the extended
Sitemap syntax for images, linked to above. That's what we should do,
because it would make that information available to Google, and
potentially to any other search engines that can read that standard.
I have created a preliminary sitemap file for Commons on the toolserver.
I use categories to find licenses, currently CC-BY-SA, CC-BY, GFDL,
and PD. This can assign 9,355,602 of our 11.3M files at least one
license. (There might be multiple entries for the same file in there,
though.) It's farm from complete, but a reasonable start IMHO.
For those with toolserver access, the file is here (300MB gzipped):
/mnt/user-store/magnus/commons.sitemap.gz
Generation took 38 minutes. Script (hereby under GFDL) is here:
/home/magnus/commons_sitemap/make_sitemap.pl (utilizing /home/magnus/sql_quick )
Magnus