Thanks for doing this Magnus. I am super busy next week, going to two
conferences, but I've scheduled some time near the end of October to
evaluate this & see if I can get it working in the cluster.
On 10/14/11 12:58 AM, Magnus Manske wrote:
On Thu, Oct 13, 2011 at 9:58 PM, Neil
Kandalgaonkar<neilk(a)wikimedia.org> wrote:
Google has a standard for us to tell them the
license, in the extended
Sitemap syntax for images, linked to above. That's what we should do,
because it would make that information available to Google, and
potentially to any other search engines that can read that standard.
I have created a preliminary sitemap file for Commons on the toolserver.
I use categories to find licenses, currently CC-BY-SA, CC-BY, GFDL,
and PD. This can assign 9,355,602 of our 11.3M files at least one
license. (There might be multiple entries for the same file in there,
though.) It's farm from complete, but a reasonable start IMHO.
For those with toolserver access, the file is here (300MB gzipped):
/mnt/user-store/magnus/commons.sitemap.gz
Generation took 38 minutes. Script (hereby under GFDL) is here:
/home/magnus/commons_sitemap/make_sitemap.pl (utilizing /home/magnus/sql_quick )
Magnus
_______________________________________________
Commons-l mailing list
Commons-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l