[Commons-l] re-enabling sitemaps for Commons, or, why your image isn't in Google

Sat Mar 5 00:43:35 UTC 2011

Neil Kandalgaonkar wrote:
> So lately Google has been pinging the WMF about the lack of sitemaps on 
> Commons. If you don't know what those are, sitemaps are a way of telling 
> search engines about all the URLs that are hosted on your site, so they 
> can find them more easily, or more quickly.[1]

We have had traditionally problems with images, description pages
assumed to be images...

> I investigated this issue and found that we do have a sitemaps script in 
> maintenance, but it hasn't been enabled on the Wikipedias since 
> 2007-12-27. In the meantime it was discovered that Google wrote some 
> custom crawling bot for Recent Changes, so it was never re-enabled for them.
> 
> As for Commons: we don't have a sitemap either, but from a cursory 
> examination of Google Image Search I don't think they are crawling our 
> Recent Changes. Even if they were, there's more to life than Google -- 
> we also want to be in other search engines, tools like TinEye, etc. So 
> it would be good to have this back again.
> 
> a) any objections, volunteers, whatever, for re-enabling the sitemaps 
> script on Commons? This means probably just adding it back into daily cron.

Have you tested it first? How long does it take?

> b) anyone want to work on making it more efficient and/or better?

Commons has 13M pages. That means generating at least 260 sitemaps.
You could do some tricks grouping pages into sitemaps by page_id, and
then updating the sitemap on update, but updating your url among 10000
inside a text file would lead to lots of apaches waiting for the file
lock. That could be overcome with some kind of journal applied later to
the sitemaps, but doing a full circle, that's equivalent to updating the
sitemap based in recentchanges data.

> Google has introduced some nifty extensions to the Sitemap protocol, 
> including geocoding and (especially dear to our hearts) licensing![2] 
> However we don't have such information easily available in the database, 
> so this requires parsing through every File page, which will take 
> several millenia.
> 
> This will not work at all with the current sitemaps script as it scans 
> the entire database every time and regenerates a number of sitemaps 
> files from scratch. So, what we need is something more iterative, that 
> only scans recent stuff. (Or, using such extensions will have to wait 
> until someone brings licensing into the database).

We can start using <image:image> <image:loc> now.
The other extensions will have to wait.