So lately Google has been pinging the WMF about the lack of sitemaps on Commons. If you don't know what those are, sitemaps are a way of telling search engines about all the URLs that are hosted on your site, so they can find them more easily, or more quickly.[1]
I investigated this issue and found that we do have a sitemaps script in maintenance, but it hasn't been enabled on the Wikipedias since 2007-12-27. In the meantime it was discovered that Google wrote some custom crawling bot for Recent Changes, so it was never re-enabled for them.
As for Commons: we don't have a sitemap either, but from a cursory examination of Google Image Search I don't think they are crawling our Recent Changes. Even if they were, there's more to life than Google -- we also want to be in other search engines, tools like TinEye, etc. So it would be good to have this back again.
a) any objections, volunteers, whatever, for re-enabling the sitemaps script on Commons? This means probably just adding it back into daily cron.
b) anyone want to work on making it more efficient and/or better?
Google has introduced some nifty extensions to the Sitemap protocol, including geocoding and (especially dear to our hearts) licensing![2] However we don't have such information easily available in the database, so this requires parsing through every File page, which will take several millenia.
This will not work at all with the current sitemaps script as it scans the entire database every time and regenerates a number of sitemaps files from scratch. So, what we need is something more iterative, that only scans recent stuff. (Or, using such extensions will have to wait until someone brings licensing into the database).
[1] http://sitemaps.org/ [2] http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=1786...
On 4 March 2011, Neil Kandalgaonkar wrote:
Google has introduced some nifty extensions to the Sitemap protocol, including geocoding and (especially dear to our hearts) licensing![2] However we don't have such information easily available in the database, so this requires parsing through every File page, which will take several millenia.
This will not work at all with the current sitemaps script as it scans the entire database every time and regenerates a number of sitemaps files from scratch. So, what we need is something more iterative, that only scans recent stuff. (Or, using such extensions will have to wait until someone brings licensing into the database).
...
-- Neil Kandalgaonkar neilk@wikimedia.org
Bryan, Roan and me are working on this: http://www.mediawiki.org/wiki/License_integration
Right now we're mostly brainstorming for the best way to do this, we expect to plan real development within 2011 but it will most certainly take a while before it's done and stable, working, backwards complatible, proper usability, code reviewed and live on Commons.
-- Krinkle
On Fri, Mar 4, 2011 at 2:49 PM, Krinkle krinklemail@gmail.com wrote:
On 4 March 2011, Neil Kandalgaonkar wrote:
Google has introduced some nifty extensions to the Sitemap protocol, including geocoding and (especially dear to our hearts) licensing![2] However we don't have such information easily available in the database, so this requires parsing through every File page, which will take several millenia.
This will not work at all with the current sitemaps script as it scans the entire database every time and regenerates a number of sitemaps files from scratch. So, what we need is something more iterative, that only scans recent stuff. (Or, using such extensions will have to wait until someone brings licensing into the database).
...
-- Neil Kandalgaonkar neilk@wikimedia.org
Bryan, Roan and me are working on this: http://www.mediawiki.org/wiki/License_integration
We got indeed started on this, but during discussion at this list I found out that perhaps we were not following the proper approach.
Bryan
Neil Kandalgaonkar wrote:
So lately Google has been pinging the WMF about the lack of sitemaps on Commons. If you don't know what those are, sitemaps are a way of telling search engines about all the URLs that are hosted on your site, so they can find them more easily, or more quickly.[1]
We have had traditionally problems with images, description pages assumed to be images...
I investigated this issue and found that we do have a sitemaps script in maintenance, but it hasn't been enabled on the Wikipedias since 2007-12-27. In the meantime it was discovered that Google wrote some custom crawling bot for Recent Changes, so it was never re-enabled for them.
As for Commons: we don't have a sitemap either, but from a cursory examination of Google Image Search I don't think they are crawling our Recent Changes. Even if they were, there's more to life than Google -- we also want to be in other search engines, tools like TinEye, etc. So it would be good to have this back again.
a) any objections, volunteers, whatever, for re-enabling the sitemaps script on Commons? This means probably just adding it back into daily cron.
Have you tested it first? How long does it take?
b) anyone want to work on making it more efficient and/or better?
Commons has 13M pages. That means generating at least 260 sitemaps. You could do some tricks grouping pages into sitemaps by page_id, and then updating the sitemap on update, but updating your url among 10000 inside a text file would lead to lots of apaches waiting for the file lock. That could be overcome with some kind of journal applied later to the sitemaps, but doing a full circle, that's equivalent to updating the sitemap based in recentchanges data.
Google has introduced some nifty extensions to the Sitemap protocol, including geocoding and (especially dear to our hearts) licensing![2] However we don't have such information easily available in the database, so this requires parsing through every File page, which will take several millenia.
This will not work at all with the current sitemaps script as it scans the entire database every time and regenerates a number of sitemaps files from scratch. So, what we need is something more iterative, that only scans recent stuff. (Or, using such extensions will have to wait until someone brings licensing into the database).
We can start using image:image image:loc now. The other extensions will have to wait.
On 3/4/11 4:43 PM, Platonides wrote:
Neil Kandalgaonkar wrote:
So lately Google has been pinging the WMF about the lack of sitemaps on Commons. If you don't know what those are, sitemaps are a way of telling search engines about all the URLs that are hosted on your site, so they can find them more easily, or more quickly.[1]
We have had traditionally problems with images, description pages assumed to be images...
I'm not quite sure I understand you. But I think the new extensions from Google might help make that distinction.
Have you tested it first? How long does it take?
On commons.prototype.wikimedia.org, which is a virtualized server, the script can create sitemaps at the rate of about 7K pages per second. Commons has about 13M pages. So the whole thing will probably be done in less than an hour. It executes a separate query for each namespace, so those will be open maybe 10 or 20 minutes. I don't like leaving queries open that long, but this seems like it might be okay.
Jens Frank used to be the person watching over this and he said that they just executed in by cronjob, and it didn't harm things. Still, that was in 2007.
If there are problems, there are IMO some obvious ways to make a more efficient sitemap script. It's dumb to regenerate the whole set of pages every time, especially for Commons, where content updates are rare.
lots of apaches waiting for the file lock.
No apaches involved, this is launched by cronjob.
We can start usingimage:image image:loc now. The other extensions will have to wait.
Yes.
Or, if we write a very efficient sitemapper, and only looks at files recently changed, then we could afford to parse their content for known license templates. But I'd rather wait for this to be available in the db.
On Sat, Mar 5, 2011 at 02:17, Neil Kandalgaonkar neilk@wikimedia.org wrote:
We have had traditionally problems with images, description pages assumed to be images...
I'm not quite sure I understand you. But I think the new extensions from Google might help make that distinction.
The image description page typically looks like http://commons.wikimedia.org/wiki/Image:filename.jpg
Google seemed to think that something ending with .jpg is an image and not a page.
Mathias