Re: [Analytics] [Wiki-research-l] [Release]

26 Feb 2015


      Yes and no. So, we use a sliightly more expanded version of the
ua-parser bot filtering (for example, detecting automata - wget and
Twisted Pagegetter are not bots, but they should absolutely be
filtered) and a slightly more expanded spider detection approach
(there are Wikimedia-specific spiders). To me the greater risk is
undeclared automata; I've had quite a lot of success detecting them
using various concentration and density indexes, such as the
Herfindahl, orienting around {ip,xff} tuples or user agents, but it
requires >=1,000 pageviews to a particular URL to be useful.
So, there is more we can do - but it becomes complex and
computationally intensive, and requires constant hand-coding to
maintain. I have much sympathy for whoever it is in R&D who has to
absorb my work, because a lot of it is maintaining things like this,
and pageviews are of limited utility for most purposes without this
kind of filtering.
On 26 February 2015 at 02:31, Federico Leva (Nemo) nemowiki@gmail.com wrote:
...
Erik Zachte, 25/02/2015 23:34:
...
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/  and
http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageB...
Ironholds' looks more vulnerable to bots, it's easier to see in small wikis
(though, kudos! many more small wikis are included than in wikistats). For
instance, 20 more percentage points for USA on Breton and Bavarian
Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese
bots they look similar, though in some cases I'm not sure what's going on:
for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Wiki-research-l] [Release]