Jeremy,
Some background:
So we are talking about search engine crawlers here, right?
Here are most active crawlers:
http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm
for Google there is special page with more depth:
http://stats.wikimedia.org/wikimedia/squids/SquidReportGoogle.htm
It's been a long standing request to filter crawler data from page views.
We almost did it a year ago, and planned to have two sets of counts in Domas' files (one with crawlers included, one without)..
I'm not sure what came in the way. Diederik can tell you more about that, and current status.
It would cut our page views by about 20%.
The test we planned to implement is pretty simple: test 'user agent' field for 'crawler/spider/bot/http' reject if any occurs.
Note user agent string is completely unregulated, but an informal rule is to include url only on crawler requests.
BTW crawlers 'bots' not to be confused with Mediawiki bots:
http://stats.wikimedia.org/EN/BotActivityMatrixEdits.htm
http://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm
Erik Zachte
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Thursday, July 11, 2013 6:55 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Wikipedia Top 25
We have some bot information from wikistats here http://stats.wikimedia.org/#bots . I don't think it's particularly actionable for what you are doing, but it might be interesting directionally.
-Toby
On Wed, Jul 10, 2013 at 3:09 PM, Jeremy Baron jeremy@tuxmachine.com wrote:
On Wed, Jul 10, 2013 at 9:54 PM, Noneof MicrosoftsBusiness phonenumberofthebeast@hotmail.com wrote:
We've been working on tracking down the top 25 articles for each week, but as you can see
http://en.wikipedia.org/wiki/Wikipedia:5000
it requires determining which rankings are due to actual human views and which are due to bots, and recently, the bots have been having a field
day.
I've been asked by the creator of the list to ask you for help and/or
advice
on how to use analytics to separate human from non-human views. Please let me know if there's anything that can be done.
I think at this point that would either require a change to the format of the domas (anonymized) stats or an NDA and maybe some other approvals. (or kraken! but rumor is that's not yet ready for the general public)
-Jeremy
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics