Hello:
We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were, in fact, coming from automated agents not disclosed as such. This was a well known fact for our community as for a couple years now they have been applying filtering rules for any "Top X" list compiled [1]. We have incorporated some of these filters (and others) to our automated traffic detection and, as of this week, traffic that meets the filtering criteria is now automatically excluded from being counted towards "top" lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] in the course of a month. Not all projects are affected equally when it comes to reduction of "user pageviews". The biggest effect is on English Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly affected (< 1%).
If you are curious as what problems this type of traffic causes in the data, this ticket for Hungarian Wikipedia is a good example of issues inflicted by what we call "bot vandalism/bot spam": https://phabricator.wikimedia.org/T237282
Given the delicate nature of this data we have worked for many months now on vetting the algorithms we are using. We will appreciate reports via phab ticket for any issues you might find.
Thanks,
Nuria
[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection...
Nuria,
Thank you for this update! I'm very excited about this new system.
I did notice that there's not much explanation of the particular rules or strategies that are used to identify automated traffic, or a link to the implementing code. I can imagine this might be intentional, to make it harder for the spammers and vandals to evade the system. If so, it would be helpful to update the page to say that explicitly and explain how people can request more details if they have a legitimate need for them.
On Tue, 5 May 2020 at 02:40, Nuria Ruiz nruiz@wikimedia.org wrote:
Hello:
We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were, in fact, coming from automated agents not disclosed as such. This was a well known fact for our community as for a couple years now they have been applying filtering rules for any "Top X" list compiled [1]. We have incorporated some of these filters (and others) to our automated traffic detection and, as of this week, traffic that meets the filtering criteria is now automatically excluded from being counted towards "top" lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] in the course of a month. Not all projects are affected equally when it comes to reduction of "user pageviews". The biggest effect is on English Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly affected (< 1%).
If you are curious as what problems this type of traffic causes in the data, this ticket for Hungarian Wikipedia is a good example of issues inflicted by what we call "bot vandalism/bot spam": https://phabricator.wikimedia.org/T237282
Given the delicate nature of this data we have worked for many months now on vetting the algorithms we are using. We will appreciate reports via phab ticket for any issues you might find.
Thanks,
Nuria
[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Neil:
Some of the rules used to identify automated traffic have been used by the community for now couple years. See for example [1] and [2]. For more information you can always ping us.
Thanks,
Nuria
[1] https://tools.wmflabs.org/topviews/faq/#false_positive [2] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions
On Wed, May 13, 2020 at 7:44 AM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:
Nuria,
Thank you for this update! I'm very excited about this new system.
I did notice that there's not much explanation of the particular rules or strategies that are used to identify automated traffic, or a link to the implementing code. I can imagine this might be intentional, to make it harder for the spammers and vandals to evade the system. If so, it would be helpful to update the page to say that explicitly and explain how people can request more details if they have a legitimate need for them.
On Tue, 5 May 2020 at 02:40, Nuria Ruiz nruiz@wikimedia.org wrote:
Hello:
We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were, in fact, coming from automated agents not disclosed as such. This was a well known fact for our community as for a couple years now they have been applying filtering rules for any "Top X" list compiled [1]. We have incorporated some of these filters (and others) to our automated traffic detection and, as of this week, traffic that meets the filtering criteria is now automatically excluded from being counted towards "top" lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] in the course of a month. Not all projects are affected equally when it comes to reduction of "user pageviews". The biggest effect is on English Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly affected (< 1%).
If you are curious as what problems this type of traffic causes in the data, this ticket for Hungarian Wikipedia is a good example of issues inflicted by what we call "bot vandalism/bot spam": https://phabricator.wikimedia.org/T237282
Given the delicate nature of this data we have worked for many months now on vetting the algorithms we are using. We will appreciate reports via phab ticket for any issues you might find.
Thanks,
Nuria
[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ah, nice!
I noticed that en:Main_Page traffic dropped by 40% as early as April 30, 5 days before Nuria's message. https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
Just double-checking whether the drop is caused by the change in logging.
Thanks! Bob
On Mon, May 4, 2020 at 11:10 PM Nuria Ruiz nruiz@wikimedia.org wrote:
Hello:
We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were, in fact, coming from automated agents not disclosed as such. This was a well known fact for our community as for a couple years now they have been applying filtering rules for any "Top X" list compiled [1]. We have incorporated some of these filters (and others) to our automated traffic detection and, as of this week, traffic that meets the filtering criteria is now automatically excluded from being counted towards "top" lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] in the course of a month. Not all projects are affected equally when it comes to reduction of "user pageviews". The biggest effect is on English Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly affected (< 1%).
If you are curious as what problems this type of traffic causes in the data, this ticket for Hungarian Wikipedia is a good example of issues inflicted by what we call "bot vandalism/bot spam": https://phabricator.wikimedia.org/T237282
Given the delicate nature of this data we have worked for many months now on vetting the algorithms we are using. We will appreciate reports via phab ticket for any issues you might find.
Thanks,
Nuria
[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Robert: the pageview tool now also shows automated views, so you can check that it is indeed traffic detected as unreported bots:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
On Thu, May 14, 2020 at 7:14 AM Robert West west@cs.stanford.edu wrote:
Ah, nice!
I noticed that en:Main_Page traffic dropped by 40% as early as April 30, 5 days before Nuria's message.
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
Just double-checking whether the drop is caused by the change in logging.
Thanks! Bob
On Mon, May 4, 2020 at 11:10 PM Nuria Ruiz nruiz@wikimedia.org wrote:
Hello:
We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were, in fact, coming from automated agents not disclosed as such. This was a well known fact for our community as for a couple years now they have been applying filtering rules for any "Top X" list compiled [1]. We have incorporated some of these filters (and others) to our automated traffic detection and, as of this week, traffic that meets the filtering criteria is now automatically excluded from being counted towards "top" lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] in the course of a month. Not all projects are affected equally when it comes to reduction of "user pageviews". The biggest effect is on English Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly affected (< 1%).
If you are curious as what problems this type of traffic causes in the data, this ticket for Hungarian Wikipedia is a good example of issues inflicted by what we call "bot vandalism/bot spam": https://phabricator.wikimedia.org/T237282
Given the delicate nature of this data we have worked for many months now on vetting the algorithms we are using. We will appreciate reports via phab ticket for any issues you might find.
Thanks,
Nuria
[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ah, cool. Thanks a lot for pointing this out, Francisco! It's great that the automated views are separated out now.
Thanks! Bob
On Thu, May 14, 2020 at 7:19 AM Francisco Dans fdans@wikimedia.org wrote:
Robert: the pageview tool now also shows automated views, so you can check that it is indeed traffic detected as unreported bots:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
On Thu, May 14, 2020 at 7:14 AM Robert West west@cs.stanford.edu wrote:
Ah, nice!
I noticed that en:Main_Page traffic dropped by 40% as early as April 30, 5 days before Nuria's message.
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
Just double-checking whether the drop is caused by the change in logging.
Thanks! Bob
On Mon, May 4, 2020 at 11:10 PM Nuria Ruiz nruiz@wikimedia.org wrote:
Hello:
We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were, in fact, coming from automated agents not disclosed as such. This was a well known fact for our community as for a couple years now they have been applying filtering rules for any "Top X" list compiled [1]. We have incorporated some of these filters (and others) to our automated traffic detection and, as of this week, traffic that meets the filtering criteria is now automatically excluded from being counted towards "top" lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] in the course of a month. Not all projects are affected equally when it comes to reduction of "user pageviews". The biggest effect is on English Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly affected (< 1%).
If you are curious as what problems this type of traffic causes in the data, this ticket for Hungarian Wikipedia is a good example of issues inflicted by what we call "bot vandalism/bot spam": https://phabricator.wikimedia.org/T237282
Given the delicate nature of this data we have worked for many months now on vetting the algorithms we are using. We will appreciate reports via phab ticket for any issues you might find.
Thanks,
Nuria
[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Francisco Dans (él, he, 彼)* Software Engineer, Analytics Team stats.wikimedia.org Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics