Re: [Analytics] WikimediaBot convention

3 Feb 2016


      Hi again analytics list,
Thank you all for your comments and feedback!
We consider this thread closed and will now proceed to:
1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy,
   to encourage including (optional) the word "bot" (case-insensitive) in the
   User-Agent string, so that bots that generate pageviews not consumed onsite
   by humans can be easily identified by the Analytics cluster, thus
   increasing accuracy of the human-vs-bot traffic split.
2. Advertise the convention and reach out to bot/framework maintainers
   to increase the share of bots that implement the User-Agent policy.
3. The Analytics team should implement the regular expressions that
   match the current User-Agent policy: User-Agent strings with: emails, user
   pages, other mediawiki urls, github urls, and tools.wmflabs.org urls.
   This will take some time, and probably raise technical issues, but seems
   that we can benefit from it. https://phabricator.wikimedia.org/T125731
Cheers!
On Wed, Feb 3, 2016 at 11:43 PM, Marcel Ruiz Forns mforns@wikimedia.org
wrote:
...
John, thank you a lot for taking the time to answer my question. My
responses inline (I rearranged some of your paragraphs to respond to them
together):
I think you need to clearly define what you want to capture and
...
classify, and re-evaluate what change to the user-agent policy will
have any noticeable impact on your detection accuracy in the next five
years.
&
...
If you do not want Googlebot to be grouped together with api based
bots , either the user-agent need to use something more distinctive,
such as 'MediaWikiBot', or you will need another regex of all the
'bot' matches which you dont want to be a bot.
&
...
The `analytics-refinery-source` code currently differentiates between
spider and bot, but earlier in this thread you said
  'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also
capture Googlebot and YandexBot, and many other tools which use 'bot'
.  Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you
can't impose this change onto.
Differentiating between "spiders" and "bots" can be very tricky, as you
explain. There was some work on it in the past, but what we really want at
the moment is: to split the human vs bot traffic with a higher accuracy. I
will add that to the docs, thanks. Regarding measuring the impact, as we'll
not be able to differentiate "spiders" and "bots", we can only observe the
variations of the human vs bot traffic rates in time and try to associate
those to recent changes in User-Agent strings or regular expressions.
The eventual definition of 'bot' will be very central to this issue.
...
Which tools need to start adding 'bot'?  What is 'human' use?  This
terminology problem has caused much debate on the wikis, reaching
arbcom several times.  So, precision in the definition will be quite
helpful.
Agree, will add that to the proposal.
One of the strange area's to consider is jquery-based tools that are
...
effectively bots, performing large numbers of operations on pages in
batches with only high-level commands being given by a human.  e.g.
the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
scripts are also not a 'bot'.
I think the key here is: the program should be tagged as a bot by
analytics, if it generates pageviews not consumed onsite by a human. I will
mention that in the docs, too. Thanks.
...
If gadgets and user-scripts may need to follow the new 'bot' rule of
the user-agent policy, the number of developers that need to be
engaged is much larger.
&
...
Please understand the gravity of what you are imposing.  Changing a
user-agent of a client is a breaking change, and any decent MediaWiki
client is also used by non-Wikimedia wikis, administrated by
non-Wikimedia ops teams, who may have their own tools doing analysis
of user-agents hitting their servers, possibly including access
control rules.  And their rules and scripts may break when a client
framework changes its user-agent in order to make the Wikimedia
Analytics scripts easier.  Strictly speaking your user-agent policy
proposal requires a new _major_ release for every client framework
that you do not grandfather into your proposed user-agent policy.
&
...
If you are only updating the policy to "encourage" the use of the
'bot' in the user-agent, there will not be any concerns as this is
quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.
I see your point. The addition of "bot" will be optional (as is the rest
of the policy), we will make that clear in the docs.
If the proposal is to require only 'bot' in the user-agent,
...
pywikipediabot and pywikibot both need no change to add it (yay!, but
do we need to add 'human' to the user-agent for some scripts??), but
many client frameworks will still need to change their user-agent,
including for example both of the Go frameworks.
https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
&
...
By doing some analysis of the existing user-agents hitting your
servers, maybe you can find an easy way to grandfather in most client
frameworks.   e.g. if you also add 'github' as a bot pattern, both Go
frameworks are automatically now also supported.
&
...
[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail
address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most
well written bots.
Also including any requests from tools.wmflabs.org and friends as
'bot' might also be a useful improvement.
That is a very good insight. Thanks. Currently, the User-Agent policy is
not implemented in our regular expressions, meaning: it does not match
emails, nor user pages or other mediawiki urls. It could also, as you
suggest, implement matching github accounts, or tools.wmflabs.org. We
Analytics should tackle that. I will create a task for that and add it to
the proposal.
Thanks again, in short I'll send the proposal with the changes.
On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg jayvdb@gmail.com
wrote:
...
On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns mforns@wikimedia.org
wrote:
...
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive
this thread to a conclusion.
...

Create a plan to block clients that dont implement the (amended)

User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first -
they should be done anyway before 3) - and then we can see how much
benefit
...
we raise from them. If we don't get a satisfactory reaction from
bot/framework maintainers, we then can go for 3). John, would you be OK
with
...
that?
I think you need to clearly define what you want to capture and
classify, and re-evaluate what change to the user-agent policy will
have any noticeable impact on your detection accuracy in the next five
years.
The eventual definition of 'bot' will be very central to this issue.
Which tools need to start adding 'bot'?  What is 'human' use?  This
terminology problem has caused much debate on the wikis, reaching
arbcom several times.  So, precision in the definition will be quite
helpful.
One of the strange area's to consider is jquery-based tools that are
effectively bots, performing large numbers of operations on pages in
batches with only high-level commands being given by a human.  e.g.
the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
scripts are also not a 'bot'.
If gadgets and user-scripts may need to follow the new 'bot' rule of
the user-agent policy, the number of developers that need to be
engaged is much larger.
If the proposal is to require only 'bot' in the user-agent,
pywikipediabot and pywikibot both need no change to add it (yay!, but
do we need to add 'human' to the user-agent for some scripts??), but
many client frameworks will still need to change their user-agent,
including for example both of the Go frameworks.
https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
By doing some analysis of the existing user-agents hitting your
servers, maybe you can find an easy way to grandfather in most client
frameworks.   e.g. if you also add 'github' as a bot pattern, both Go
frameworks are automatically now also supported.
Please understand the gravity of what you are imposing.  Changing a
user-agent of a client is a breaking change, and any decent MediaWiki
client is also used by non-Wikimedia wikis, administrated by
non-Wikimedia ops teams, who may have their own tools doing analysis
of user-agents hitting their servers, possibly including access
control rules.  And their rules and scripts may break when a client
framework changes its user-agent in order to make the Wikimedia
Analytics scripts easier.  Strictly speaking your user-agent policy
proposal requires a new _major_ release for every client framework
that you do not grandfather into your proposed user-agent policy.
Poorly written/single-purpose/once-off clients are less of a problem,
as forcing change on them has lower impact.
[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail
address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most
well written bots.
Also including any requests from tools.wmflabs.org and friends as
'bot' might also be a useful improvement.
The `analytics-refinery-source` code currently differentiates between
spider and bot, but earlier in this thread you said
'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also
capture Googlebot and YandexBot, and many other tools which use 'bot'
.  Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you
can't impose this change onto.
If you do not want Googlebot to be grouped together with api based
bots , either the user-agent need to use something more distinctive,
such as 'MediaWikiBot', or you will need another regex of all the
'bot' matches which you dont want to be a bot.
...
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to
encourage including the word "bot" (case-insensitive) in the User-Agent
string, so that bots can be easily identified.
If you are only updating the policy to "encourage" the use of the
'bot' in the user-agent, there will not be any concerns as this is
quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.
--
John Vandenberg

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] WikimediaBot convention