Re: [Analytics] Wikipedia Top 25

18 Jul 2013

Erik,

When trying to identify non-human traffic, it can also be helpful to
exclude any session that begins (or contains) a request for /robots.txt. Do
you know if a mechanism exists that could flag such sessions?

Even though many crawlers and spiders do not respect the contents of the
file, they almost always request it. This can help exclude non-human page
requests where the user agent string has been set to something that does
not contain a bot/crawler url.

--Michael

On Thu, Jul 11, 2013 at 4:26 PM, Erik Zachte &lt;ezachte(a)wikimedia.org&gt; wrote:

...
  Jeremy,****

 ** **

 Some background:****

 ** **

 So we are talking about search engine crawlers here, right?****

 ** **

 Here are most active crawlers:****

 http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm****

 for Google there is special page with more depth:****

 http://stats.wikimedia.org/wikimedia/squids/SquidReportGoogle.htm****

 * *

 It's been a long standing request to filter crawler data from page views.
 ****

 We almost did it a year ago, and planned to have two sets of counts in
 Domas' files (one with crawlers included, one without).. ****

 I'm not sure what came in the way. Diederik can tell you more about that,
 and current status.****

 ** **

 It would cut our page views by about 20%. ****

 The test we planned to implement is pretty simple: test 'user agent' field
 for 'crawler/spider/bot/http' reject if any occurs.****

 Note user agent string is completely unregulated, but an informal rule is
 to include url only on crawler requests. ****

 ** **

 BTW crawlers 'bots' not to be confused with Mediawiki bots:****

 http://stats.wikimedia.org/EN/BotActivityMatrixEdits.htm****

 http://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm****

 ** **

 Erik Zachte****

 ** **

 ** **

 *From:* analytics-bounces(a)lists.wikimedia.org [mailto:
 analytics-bounces(a)lists.wikimedia.org] *On Behalf Of *Toby Negrin
 *Sent:* Thursday, July 11, 2013 6:55 AM
 *To:* A mailing list for the Analytics Team at WMF and everybody who has
 an interest in Wikipedia and analytics.

 *Subject:* Re: [Analytics] Wikipedia Top 25****

 ** **

 We have some bot information from wikistats
here<http://stats.wikimedia.org/#bots>ts>.
 I don't think it's particularly actionable for what you are doing, but it
 might be interesting directionally.****

 ** **

 -Toby****

 ** **

 On Wed, Jul 10, 2013 at 3:09 PM, Jeremy Baron &lt;jeremy(a)tuxmachine.com&gt;
 wrote:****

 On Wed, Jul 10, 2013 at 9:54 PM, Noneof MicrosoftsBusiness
 &lt;phonenumberofthebeast(a)hotmail.com&gt; wrote:
  We've been working on tracking down the top
25 articles for each week,  but
  as you can see

 http://en.wikipedia.org/wiki/Wikipedia:5000

 it requires determining which rankings are due to actual human views and
 which are due to bots, and recently, the bots have been having a field  day.
  I've been asked by the creator of the list to
ask you for help and/or  advice
  on how to use analytics to separate human from
non-human views. Please  let
  me know if there's anything that can be
done.**** 
 I think at this point that would either require a change to the format
 of the domas (anonymized) stats or an NDA and maybe some other
 approvals. (or kraken! but rumor is that's not yet ready for the
 general public)

 -Jeremy

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics****

 ** **

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 

Michael Wilkes
mobile +31 6 39629706
skype handle eclectiqus
www.eclectiq.com

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Wikipedia Top 25