Yes and no. So, we use a sliightly more expanded version of the ua-parser bot filtering (for example, detecting automata - wget and Twisted Pagegetter are not bots, but they should absolutely be filtered) and a slightly more expanded spider detection approach (there are Wikimedia-specific spiders). To me the greater risk is undeclared automata; I've had quite a lot of success detecting them using various concentration and density indexes, such as the Herfindahl, orienting around {ip,xff} tuples or user agents, but it requires >=1,000 pageviews to a particular URL to be useful.
So, there is more we can do - but it becomes complex and computationally intensive, and requires constant hand-coding to maintain. I have much sympathy for whoever it is in R&D who has to absorb my work, because a lot of it is maintaining things like this, and pageviews are of limited utility for most purposes without this kind of filtering.
On 26 February 2015 at 02:31, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Erik Zachte, 25/02/2015 23:34:
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and
http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageB...
Ironholds' looks more vulnerable to bots, it's easier to see in small wikis (though, kudos! many more small wikis are included than in wikistats). For instance, 20 more percentage points for USA on Breton and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese bots they look similar, though in some cases I'm not sure what's going on: for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics