On 04/11/2012 01:45 AM, Erik Zachte wrote:
Here are some numbers on total bot burden:
http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states for March 2012:
In total 69.5 M page requests (mime type text/html only!) per day are considered crawler requests, out of 696 M page requests (10.0%) or 469 M external page requests (14.8%). About half (35.1 M) of crawler requests come from Google.
The fraction will be larger than average (larger than 10%) for a) sites with many small pages (Wiktionary) and b) sites in languages with a smaller audience (Swedish sites). Bots will index these pages as they are found, but each of these pages can expect fewer search hits and less human traffic than long articles (Wikipedia) in languages with many speakers (English). The bot traffic is like a constant background noise, and the human traffic is the signal on top. Sites with many small pages and a small audience will have a lower signal-to-noise ratio. The long tail of seldom visited pages is drowning in that noise.
I should disclose that I "work for the competition". I tried to add books to Wikisource, but its complexity slows me down so I'm now focusing on my own Scandinavian book scanning website Project Runeberg, http://runeberg.org/
It has 700,000 scanned book pages, the same size as the English Wikisource, which is a large number of pages for a small language audience (mostly Swedish). Yesterday, April 10, its Apache access log had 291,000 hits, of which 116,000 are HTML pages, but 71,000 match bot/spider/crawler, leaving only 45,000 human page views. If Swedish Wikisource which is 1/20 that size would get 10-13 thousand human page views per day or 1/4 of that web traffic, I'd be surprised. It is more likely that 71/116 = 61% is bot traffic.
(Are we competitors? Really not. We're both liberating content. Swedish Wikipedia has more external links to runeberg.org than to any other website.)