Evan Martin wrote:
On 9/2/05, Angela beesley@gmail.com wrote:
Google miscategorizes the language of some of the Hungarian Wikipedia pages. E.g. it thinks that our Adolf Hitler article is in czech.
How do you know they are miscategorising the language? http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Ahu.wikipedia.org
This makes it seem like they haven't indexed the page at all, not that they've marked it as the wrong language.
I believe this is correct. We've had problems in the past with overloading wikipedia so the crawl has been throttled way back.
Throttled back to what? I remember having problems in late 2003 and early 2004, but we have 30 times as much hardware now, so we could handle a proportionally higher crawl rate, say 10 or 20 requests per second. That is, unless one of the other system administrators wants to disagree.
This doesn't mean every man and his dog can try to crawl our site with wget at 20 requests per second. Our number of users is higher, and those users still have to be careful not to take any more than their fair share of the resources, or else they risk being blocked without notice. I would still recommend a crawl delay of at least 2 seconds for users without specific approval. But I think we're in a position now where we can make an exception for the big search engines.
-- Tim Starling