Message: 7 Date: Tue, 11 Mar 2003 12:19:36 -0800 (PST) From: Brion Vibber vibber@aludra.usc.edu Subject: Re: [Wikitech-l] Re: what's going on with wikipedia ? To: wikitech-l@wikipedia.org Reply-To: wikitech-l@wikipedia.org
On Tue, 11 Mar 2003, Lee Daniel Crocker wrote:
It appears we're being Googled this morning. Googlebot is very well-behaved, and I'm not sure if that's the problem or not, but
Googlebot is fairly light (several seconds to 30 seconds between requests, and follows our robots.txt restrictions - its getting articles, not millions of diffs or contribs pages. It's only a fraction of total pages being served). I have written them an e-mail asking if it's possible to restrict the spidering to off-peak hours, though.
-- brion vibber (brion @ pobox.com)
Well Brian, off-peak hours is a bit of problem with an international website, isn't it? When germany goes to lunch (12:00 CET - Central European Time), the people in San Francisco come home from the bar (3:00 AM PST - Pacific Standard Time). So I think google cannot really do anything about, except treating every sub-domain according to it's timezone. (otherwise people in europe will ALWAYS have a slow wikipedia, because google thinks that is off-peak time). Another idea which might or might not work is the Apache Module mod_throttle http://www.snert.com/Software/mod_throttle/ You could give a general minimum idle time between requests or you could give penalties to db-heavy documents. But of course, this would make things still slower to some, but at least the server will take the load without coming close to a crash.
Cheers Leo
On Wed, 12 Mar 2003, Leonard Tulipan wrote:
Well Brian, off-peak hours is a bit of problem with an international website, isn't it?
No, not really. 02:00-14:00 UTC there is overall lower traffic than 14:00-02:00. "Off-peak" isn't the same as "no one at all uses the site".
However, I've heard back from the Google folks and they apparently have no such configuration option for googlebot to limit the times when particular domains are spidered:
Hi Brion,
Thanks for contacting Google.
We apologize for the load you are experiencing on your web servers. The crawling of websites by Googlebot is an automated process, therefore we are unable to make manual requests for Googlebot. What we can offer to do is to slow down the crawl rate for your site. If you would like us to do so, please advise and we will honor your request to do so. In the meantime, if you would prefer not to have Googlebot crawl your website, please refer to our FAQ at http://www.google.com/webmasters/3.html#B3 . We hope this information is helpful. Please contact us if you have any further questions.
Regards, The Google Team
-- brion vibber (brion @ pobox.com)
Very interesting. I'd hate to slow googlebot down on our site. We love google, and we love that google loves us back.
Brion Vibber wrote:
On Wed, 12 Mar 2003, Leonard Tulipan wrote:
Well Brian, off-peak hours is a bit of problem with an international website, isn't it?
No, not really. 02:00-14:00 UTC there is overall lower traffic than 14:00-02:00. "Off-peak" isn't the same as "no one at all uses the site".
However, I've heard back from the Google folks and they apparently have no such configuration option for googlebot to limit the times when particular domains are spidered:
Hi Brion,
Thanks for contacting Google.
We apologize for the load you are experiencing on your web servers. The crawling of websites by Googlebot is an automated process, therefore we are unable to make manual requests for Googlebot. What we can offer to do is to slow down the crawl rate for your site. If you would like us to do so, please advise and we will honor your request to do so. In the meantime, if you would prefer not to have Googlebot crawl your website, please refer to our FAQ at http://www.google.com/webmasters/3.html#B3 . We hope this information is helpful. Please contact us if you have any further questions.
Regards, The Google Team
-- brion vibber (brion @ pobox.com)
Wikitech-l mailing list Wikitech-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikitech-l
What about include a counter to see when the peak hours happen ????.
It could be a guide (perhaps every language wikipedia would have its own counter).
Regards. ----- Original Message ----- From: "Leonard Tulipan" l.tulipan@mpwi.at To: wikitech-l@wikipedia.org Sent: Wednesday, March 12, 2003 3:06 PM Subject: [Wikitech-l] Re: Re: Re: what's going on with wikipedia ? (googlebot)
Message: 7 Date: Tue, 11 Mar 2003 12:19:36 -0800 (PST) From: Brion Vibber vibber@aludra.usc.edu Subject: Re: [Wikitech-l] Re: what's going on with wikipedia ? To: wikitech-l@wikipedia.org Reply-To: wikitech-l@wikipedia.org
On Tue, 11 Mar 2003, Lee Daniel Crocker wrote:
It appears we're being Googled this morning. Googlebot is very well-behaved, and I'm not sure if that's the problem or not, but
Googlebot is fairly light (several seconds to 30 seconds between requests, and follows our robots.txt restrictions - its getting articles, not millions of diffs or contribs pages. It's only a fraction of total pages being served). I have written them an e-mail asking if it's possible to restrict the spidering to off-peak hours, though.
-- brion vibber (brion @ pobox.com)
Well Brian, off-peak hours is a bit of problem with an international website, isn't it? When germany goes to lunch (12:00 CET - Central European Time), the people in San Francisco come home from the bar (3:00 AM PST - Pacific Standard Time). So I think google cannot really do anything about, except treating every sub-domain according to it's timezone. (otherwise people in europe will ALWAYS have a slow wikipedia, because google thinks that is off-peak time). Another idea which might or might not work is the Apache Module mod_throttle http://www.snert.com/Software/mod_throttle/ You could give a general minimum idle time between requests or you could give penalties to db-heavy documents. But of course, this would make things still slower to some, but at least the server will take the load without coming close to a crash.
Cheers Leo
Wikitech-l mailing list Wikitech-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikitech-l
On Fri, 2003-03-14 at 03:23, Pedro M.V. wrote:
What about include a counter to see when the peak hours happen ????.
http://www.wikipedia.org/stats http://fr.wikipedia.org/stats etc
-- brion vibber (brion @ pobox.com)
A problem with google was experienced by me some days ago.
I tried to find "río" ( this is river ) and recieved the editing page of the article. This is, some information wasn´t in the page and there was a risk of mis-edition ( specially for newbbies).
Regards.
wikitech-l@lists.wikimedia.org