On 3/4/11 4:43 PM, Platonides wrote:
Neil Kandalgaonkar wrote:
So lately Google has been pinging the WMF about
the lack of sitemaps on
Commons. If you don't know what those are, sitemaps are a way of telling
search engines about all the URLs that are hosted on your site, so they
can find them more easily, or more quickly.[1]
We have had traditionally problems with images, description pages
assumed to be images...
I'm not quite sure I understand you. But I think the new extensions from
Google might help make that distinction.
Have you tested it first? How long does it take?
On
commons.prototype.wikimedia.org, which is a virtualized server, the
script can create sitemaps at the rate of about 7K pages per second.
Commons has about 13M pages. So the whole thing will probably be done in
less than an hour. It executes a separate query for each namespace, so
those will be open maybe 10 or 20 minutes. I don't like leaving queries
open that long, but this seems like it might be okay.
Jens Frank used to be the person watching over this and he said that
they just executed in by cronjob, and it didn't harm things. Still, that
was in 2007.
If there are problems, there are IMO some obvious ways to make a more
efficient sitemap script. It's dumb to regenerate the whole set of pages
every time, especially for Commons, where content updates are rare.
lots of apaches waiting for the file
lock.
No apaches involved, this is launched by cronjob.
We can start using<image:image>
<image:loc> now.
The other extensions will have to wait.
Yes.
Or, if we write a very efficient sitemapper, and only looks at files
recently changed, then we could afford to parse their content for known
license templates. But I'd rather wait for this to be available in the db.
--
Neil Kandalgaonkar <neilk(a)wikimedia.org>