Just something that came to my mind... Google caches the wikipedia pages just like we were doing. Are you blocking Google as well?
-----Original Message----- From: Webmaster [mailto:webmaster@tiosam.com] Sent: Tuesday, January 30, 2007 12:27 PM To: 'Wikimedia developers' Subject: RE: [Wikitech-l] Our IP was blocked by mistake
Thanks. That explains it. Also, the IP shown is not enciclopedia.tiosam.com nor ebaita.com: it is www.tiosam.com where I just included the English version a couple of weeks ago (http://www.tiosam.com/Ingles/encyclopedia ). The load was probably google indexing the pages in English, not what we already have cached for the Portuguese version. Anyway, I apologize for my ignorance. I'm downloading the dump and will start coding as soon as I find out how it works. Thanks guys.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Ivan Krstic Sent: Tuesday, January 30, 2007 11:40 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Our IP was blocked by mistake
Tim Starling wrote:
No faked user agent string? So I suppose you were using "save as" in IE? Mozilla/4.0%20(compatible;%20MSIE%207.0;%20Windows%20NT%205.2;%20.NET% 20CLR%201.1.4322;%20.NET%20CLR%202.0.50727)
No, he was using the Microsoft.XMLHTTP object from ASP, as he indicated in a previous message. Said object identifies itself as MSIE and gives the .NET CLR version in the User-Agent.
-- Ivan Krstić krstic@solarsail.hcs.harvard.edu | GPG: 0x147C722D
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
No -- I don't know the particulars, but I imagine pages are not cached by Google in a single burst that puts such a huge load on the server, if this were the case most large sites would have blocked Google (myspace, livejournal, ebay) and it would be much less useful.
Mark
On 30/01/07, Webmaster webmaster@tiosam.com wrote:
Just something that came to my mind... Google caches the wikipedia pages just like we were doing. Are you blocking Google as well?
-----Original Message----- From: Webmaster [mailto:webmaster@tiosam.com] Sent: Tuesday, January 30, 2007 12:27 PM To: 'Wikimedia developers' Subject: RE: [Wikitech-l] Our IP was blocked by mistake
Thanks. That explains it. Also, the IP shown is not enciclopedia.tiosam.com nor ebaita.com: it is www.tiosam.com where I just included the English version a couple of weeks ago (http://www.tiosam.com/Ingles/encyclopedia ). The load was probably google indexing the pages in English, not what we already have cached for the Portuguese version. Anyway, I apologize for my ignorance. I'm downloading the dump and will start coding as soon as I find out how it works. Thanks guys.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Ivan Krstic Sent: Tuesday, January 30, 2007 11:40 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Our IP was blocked by mistake
Tim Starling wrote:
No faked user agent string? So I suppose you were using "save as" in IE? Mozilla/4.0%20(compatible;%20MSIE%207.0;%20Windows%20NT%205.2;%20.NET% 20CLR%201.1.4322;%20.NET%20CLR%202.0.50727)
No, he was using the Microsoft.XMLHTTP object from ASP, as he indicated in a previous message. Said object identifies itself as MSIE and gives the .NET CLR version in the User-Agent.
-- Ivan Krstić krstic@solarsail.hcs.harvard.edu | GPG: 0x147C722D
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Google indexes and caches around 20,000 pages a day from our website, which I consider a small to medium to volume/traffic website. They do offer an option for slower indexing speed if the load hurts the server, though. However, we all know Google indexing and caching our pages is a necessary evil. Our site? Not so much :D
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Mark Williamson Sent: Tuesday, January 30, 2007 2:42 PM To: Wikimedia developers Subject: Re: [Wikitech-l] FW: Our IP was blocked by mistake
No -- I don't know the particulars, but I imagine pages are not cached by Google in a single burst that puts such a huge load on the server, if this were the case most large sites would have blocked Google (myspace, livejournal, ebay) and it would be much less useful.
Mark
On 30/01/07, Webmaster webmaster@tiosam.com wrote:
Just something that came to my mind... Google caches the wikipedia pages just like we were doing. Are you blocking Google as well?
-----Original Message----- From: Webmaster [mailto:webmaster@tiosam.com] Sent: Tuesday, January 30, 2007 12:27 PM To: 'Wikimedia developers' Subject: RE: [Wikitech-l] Our IP was blocked by mistake
Thanks. That explains it. Also, the IP shown is not enciclopedia.tiosam.com nor ebaita.com: it is www.tiosam.com where I just included the English version a couple of weeks ago (http://www.tiosam.com/Ingles/encyclopedia ). The load was probably google indexing the pages in English, not what we already have cached for the Portuguese version. Anyway, I apologize for my ignorance. I'm downloading the dump and will start coding as soon as I find out how it works. Thanks guys.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Ivan Krstic Sent: Tuesday, January 30, 2007 11:40 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Our IP was blocked by mistake
Tim Starling wrote:
No faked user agent string? So I suppose you were using "save as" in IE? Mozilla/4.0%20(compatible;%20MSIE%207.0;%20Windows%20NT%205.2;%20.NE T% 20CLR%201.1.4322;%20.NET%20CLR%202.0.50727)
No, he was using the Microsoft.XMLHTTP object from ASP, as he indicated in a previous message. Said object identifies itself as MSIE and gives the .NET CLR version in the User-Agent.
-- Ivan Krstić krstic@solarsail.hcs.harvard.edu | GPG: 0x147C722D
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Refije dirije lanmè yo paske nou posede pwòp bato. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 30/01/07, Mark Williamson node.ue@gmail.com wrote:
No -- I don't know the particulars, but I imagine pages are not cached by Google in a single burst that puts such a huge load on the server, if this were the case most large sites would have blocked Google (myspace, livejournal, ebay) and it would be much less useful.
Wikimedia allows spidering at a certain pace - it's going very fast that isn't allowable.
Wikimedia does sell live feeds as a service. (e.g.I think Answers.com use this.)
Most mirrors use the dumps. You can see from the download page how slow it is backing up en:wp :-)
- d.
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Wikipedia limits crawling to 1 page per second in its robots.txt
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Mark Williamson Sent: Tuesday, January 30, 2007 2:42 PM To: Wikimedia developers Subject: Re: [Wikitech-l] FW: Our IP was blocked by mistake
No -- I don't know the particulars, but I imagine pages are not cached by Google in a single burst that puts such a huge load on the server, if this were the case most large sites would have blocked Google (myspace, livejournal, ebay) and it would be much less useful.
Mark
On 30/01/07, Webmaster webmaster@tiosam.com wrote:
Just something that came to my mind... Google caches the wikipedia pages just like we were doing. Are you blocking Google as well?
-----Original Message----- From: Webmaster [mailto:webmaster@tiosam.com] Sent: Tuesday, January 30, 2007 12:27 PM To: 'Wikimedia developers' Subject: RE: [Wikitech-l] Our IP was blocked by mistake
Thanks. That explains it. Also, the IP shown is not enciclopedia.tiosam.com nor ebaita.com: it is www.tiosam.com where I just included the English version a couple of weeks ago (http://www.tiosam.com/Ingles/encyclopedia ). The load was probably google indexing the pages in English, not what we already have cached for the Portuguese version. Anyway, I apologize for my ignorance. I'm downloading the dump and will start coding as soon as I find out how it works. Thanks guys.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Ivan Krstic Sent: Tuesday, January 30, 2007 11:40 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Our IP was blocked by mistake
Tim Starling wrote:
No faked user agent string? So I suppose you were using "save as" in IE? Mozilla/4.0%20(compatible;%20MSIE%207.0;%20Windows%20NT%205.2;%20.NE T% 20CLR%201.1.4322;%20.NET%20CLR%202.0.50727)
No, he was using the Microsoft.XMLHTTP object from ASP, as he indicated in a previous message. Said object identifies itself as MSIE and gives the .NET CLR version in the User-Agent.
-- Ivan Krstić krstic@solarsail.hcs.harvard.edu | GPG: 0x147C722D
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Refije dirije lanmè yo paske nou posede pwòp bato. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Webmaster wrote:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Wikipedia limits crawling to 1 page per second in its robots.txt
Not anymore... http://en.wikipedia.org/robots.txt has:
## *at least* 1 second please. preferably more :D ## we're disabling this experimentally 11-09-2006 #Crawl-delay: 1
Matthew Flaschen
I see... Can you please unblock our IP so we can download the dumps directly to the server? I removed the updating part of the script already.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Tim Starling Sent: Tuesday, January 30, 2007 2:56 PM To: wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] FW: Our IP was blocked by mistake
Webmaster wrote:
Just something that came to my mind... Google caches the wikipedia pages just like we were doing. Are you
blocking Google as well?
Of course not, it's our biggest referrer.
-- Tim Starling
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org