New subject: re-enabling sitemaps for Commons, or, why your image isn't in Google

4 Mar 2011

So lately Google has been pinging the WMF about the lack of sitemaps on 
Commons. If you don't know what those are, sitemaps are a way of telling 
search engines about all the URLs that are hosted on your site, so they 
can find them more easily, or more quickly.[1]

I investigated this issue and found that we do have a sitemaps script in 
maintenance, but it hasn't been enabled on the Wikipedias since 
2007-12-27. In the meantime it was discovered that Google wrote some 
custom crawling bot for Recent Changes, so it was never re-enabled for them.

As for Commons: we don't have a sitemap either, but from a cursory 
examination of Google Image Search I don't think they are crawling our 
Recent Changes. Even if they were, there's more to life than Google -- 
we also want to be in other search engines, tools like TinEye, etc. So 
it would be good to have this back again.

a) any objections, volunteers, whatever, for re-enabling the sitemaps 
script on Commons? This means probably just adding it back into daily cron.

b) anyone want to work on making it more efficient and/or better?

Google has introduced some nifty extensions to the Sitemap protocol, 
including geocoding and (especially dear to our hearts) licensing![2] 
However we don't have such information easily available in the database, 
so this requires parsing through every File page, which will take 
several millenia.

This will not work at all with the current sitemaps script as it scans 
the entire database every time and regenerates a number of sitemaps 
files from scratch. So, what we need is something more iterative, that 
only scans recent stuff. (Or, using such extensions will have to wait 
until someone brings licensing into the database).

[1] http://sitemaps.org/
[2] 
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=178…

-- 
Neil Kandalgaonkar     &lt;neilk(a)wikimedia.org&gt;