[Commons-l] re-enabling sitemaps for Commons, or, why your image isn't in Google
Neil Kandalgaonkar
neilk at wikimedia.org
Fri Mar 4 03:31:31 UTC 2011
So lately Google has been pinging the WMF about the lack of sitemaps on
Commons. If you don't know what those are, sitemaps are a way of telling
search engines about all the URLs that are hosted on your site, so they
can find them more easily, or more quickly.[1]
I investigated this issue and found that we do have a sitemaps script in
maintenance, but it hasn't been enabled on the Wikipedias since
2007-12-27. In the meantime it was discovered that Google wrote some
custom crawling bot for Recent Changes, so it was never re-enabled for them.
As for Commons: we don't have a sitemap either, but from a cursory
examination of Google Image Search I don't think they are crawling our
Recent Changes. Even if they were, there's more to life than Google --
we also want to be in other search engines, tools like TinEye, etc. So
it would be good to have this back again.
a) any objections, volunteers, whatever, for re-enabling the sitemaps
script on Commons? This means probably just adding it back into daily cron.
b) anyone want to work on making it more efficient and/or better?
Google has introduced some nifty extensions to the Sitemap protocol,
including geocoding and (especially dear to our hearts) licensing![2]
However we don't have such information easily available in the database,
so this requires parsing through every File page, which will take
several millenia.
This will not work at all with the current sitemaps script as it scans
the entire database every time and regenerates a number of sitemaps
files from scratch. So, what we need is something more iterative, that
only scans recent stuff. (Or, using such extensions will have to wait
until someone brings licensing into the database).
[1] http://sitemaps.org/
[2]
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=178636
--
Neil Kandalgaonkar <neilk at wikimedia.org>
More information about the Commons-l
mailing list