On 3/4/11 4:43 PM, Platonides wrote:
Neil Kandalgaonkar wrote:
So lately Google has been pinging the WMF about the lack of sitemaps on Commons. If you don't know what those are, sitemaps are a way of telling search engines about all the URLs that are hosted on your site, so they can find them more easily, or more quickly.[1]
We have had traditionally problems with images, description pages assumed to be images...
I'm not quite sure I understand you. But I think the new extensions from Google might help make that distinction.
Have you tested it first? How long does it take?
On commons.prototype.wikimedia.org, which is a virtualized server, the script can create sitemaps at the rate of about 7K pages per second. Commons has about 13M pages. So the whole thing will probably be done in less than an hour. It executes a separate query for each namespace, so those will be open maybe 10 or 20 minutes. I don't like leaving queries open that long, but this seems like it might be okay.
Jens Frank used to be the person watching over this and he said that they just executed in by cronjob, and it didn't harm things. Still, that was in 2007.
If there are problems, there are IMO some obvious ways to make a more efficient sitemap script. It's dumb to regenerate the whole set of pages every time, especially for Commons, where content updates are rare.
lots of apaches waiting for the file lock.
No apaches involved, this is launched by cronjob.
We can start usingimage:image image:loc now. The other extensions will have to wait.
Yes.
Or, if we write a very efficient sitemapper, and only looks at files recently changed, then we could afford to parse their content for known license templates. But I'd rather wait for this to be available in the db.