Some of the rules used to identify automated traffic have been used by the
community for now couple years. See for example  and . For more
information you can always ping us.
On Wed, May 13, 2020 at 7:44 AM Neil Shah-Quinn <nshahquinn(a)wikimedia.org>
Thank you for this update! I'm very excited about this new system.
I did notice that there's not much explanation of the particular rules or
strategies that are used to identify automated traffic, or a link to the
implementing code. I can imagine this might be intentional, to make it
harder for the spammers and vandals to evade the system. If so, it would be
helpful to update the page to say that explicitly and explain how people
can request more details if they have a legitimate need for them.
On Tue, 5 May 2020 at 02:40, Nuria Ruiz <nruiz(a)wikimedia.org> wrote:
We have added the 'automated' maker to Wikimedia's pageview data. Up to
now pageview agents were classified as 'spider' (self reported bots like
'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were,
in fact, coming from automated agents not disclosed as such. This was a
well known fact for our community as for a couple years now they have been
applying filtering rules for any "Top X" list compiled . We have
incorporated some of these filters (and others) to our automated traffic
detection and, as of this week, traffic that meets the filtering
criteria is now automatically excluded from being counted towards "top"
lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall
user traffic is about a 5.6% reduction of pageviews labeled as "user" 
in the course of a month. Not all projects are affected equally when it
comes to reduction of "user pageviews". The biggest effect is on English
Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly
affected (< 1%).
If you are curious as what problems this type of traffic causes in the
data, this ticket for Hungarian Wikipedia is a good example of issues
inflicted by what we call "bot vandalism/bot spam":
Given the delicate nature of this data we have worked for many months now
on vetting the algorithms we are using. We will appreciate reports via phab
ticket for any issues you might find.
Analytics mailing list
Analytics mailing list