Here are some numbers on total bot burden:
1) http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states for March 2012:
In total 69.5 M page requests (mime type text/html only!) per day are considered crawler requests, out of 696 M page requests (10.0%) or 469 M external page requests (14.8%). About half (35.1 M) of crawler requests come from Google.
2) Here are counts from one day log, as sanity check:
zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P '/wiki/|index.php' | grep -cP ' - |text/html' => 678325
zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P '/wiki/|index.php' | grep -P ' - |text/html' | grep -ciP 'bot|crawler|spider' => 68027
68027 / 678325 = 10.0% which matches really well with numbers from SquidReportCrawlers.htm
---
My suggestion for how to filter these bots efficiently in c program (no costly nuanced regexps) before sending data to webstatscollector:
a) Find 14th field in space delimited log line = user agent (but beware of false delimiters in logs from varnish, if still applicable) b) Search this field case insensitive for bot/crawler/spider/http (by convention only bots have url in agent string)
That will filter out most bot pollution. We still want those records in sampled log though.
Any thoughts?
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of emijrp Sent: Sunday, April 08, 2012 9:21 PM To: Wikimedia developers Cc: Diederik van Liere; Lars Aronsson Subject: Re: [Wikitech-l] Page views
2012/4/8 Erik Zachte ezachte@wikimedia.org
Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' | awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
How many of that 1000 sample log were robots (including all languages)?
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l