[Commons-l] re-enabling sitemaps for Commons, or, why your image isn't in Google
Platonides
Platonides at gmail.com
Sat Mar 5 00:43:35 UTC 2011
Neil Kandalgaonkar wrote:
> So lately Google has been pinging the WMF about the lack of sitemaps on
> Commons. If you don't know what those are, sitemaps are a way of telling
> search engines about all the URLs that are hosted on your site, so they
> can find them more easily, or more quickly.[1]
We have had traditionally problems with images, description pages
assumed to be images...
> I investigated this issue and found that we do have a sitemaps script in
> maintenance, but it hasn't been enabled on the Wikipedias since
> 2007-12-27. In the meantime it was discovered that Google wrote some
> custom crawling bot for Recent Changes, so it was never re-enabled for them.
>
> As for Commons: we don't have a sitemap either, but from a cursory
> examination of Google Image Search I don't think they are crawling our
> Recent Changes. Even if they were, there's more to life than Google --
> we also want to be in other search engines, tools like TinEye, etc. So
> it would be good to have this back again.
>
> a) any objections, volunteers, whatever, for re-enabling the sitemaps
> script on Commons? This means probably just adding it back into daily cron.
Have you tested it first? How long does it take?
> b) anyone want to work on making it more efficient and/or better?
Commons has 13M pages. That means generating at least 260 sitemaps.
You could do some tricks grouping pages into sitemaps by page_id, and
then updating the sitemap on update, but updating your url among 10000
inside a text file would lead to lots of apaches waiting for the file
lock. That could be overcome with some kind of journal applied later to
the sitemaps, but doing a full circle, that's equivalent to updating the
sitemap based in recentchanges data.
> Google has introduced some nifty extensions to the Sitemap protocol,
> including geocoding and (especially dear to our hearts) licensing![2]
> However we don't have such information easily available in the database,
> so this requires parsing through every File page, which will take
> several millenia.
>
> This will not work at all with the current sitemaps script as it scans
> the entire database every time and regenerates a number of sitemaps
> files from scratch. So, what we need is something more iterative, that
> only scans recent stuff. (Or, using such extensions will have to wait
> until someone brings licensing into the database).
We can start using <image:image> <image:loc> now.
The other extensions will have to wait.
More information about the Commons-l
mailing list