John, thank you a lot for taking the time to answer my question. My responses inline (I rearranged some of your paragraphs to respond to them together):

I think you need to clearly define what you want to capture and
classify, and re-evaluate what change to the user-agent policy will
have any noticeable impact on your detection accuracy in the next five
years.

If you do not want Googlebot to be grouped together with api based
bots , either the user-agent need to use something more distinctive,
such as 'MediaWikiBot', or you will need another regex of all the
'bot' matches which you dont want to be a bot.

The `analytics-refinery-source` code currently differentiates between
spider and bot, but earlier in this thread you said
'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also
capture Googlebot and YandexBot, and many other tools which use 'bot'
. Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you
can't impose this change onto.

Differentiating between "spiders" and "bots" can be very tricky, as you explain. There was some work on it in the past, but what we really want at the moment is: to split the human vs bot traffic with a higher accuracy. I will add that to the docs, thanks. Regarding measuring the impact, as we'll not be able to differentiate "spiders" and "bots", we can only observe the variations of the human vs bot traffic rates in time and try to associate those to recent changes in User-Agent strings or regular expressions.

The eventual definition of 'bot' will be very central to this issue.
Which tools need to start adding 'bot'? What is 'human' use? This
terminology problem has caused much debate on the wikis, reaching
arbcom several times. So, precision in the definition will be quite
helpful.

Agree, will add that to the proposal.

One of the strange area's to consider is jquery-based tools that are
effectively bots, performing large numbers of operations on pages in
batches with only high-level commands being given by a human. e.g.
the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot
scripts are also not a 'bot'.

I think the key here is: the program should be tagged as a bot by analytics, if it generates pageviews not consumed onsite by a human. I will mention that in the docs, too. Thanks.

If gadgets and user-scripts may need to follow the new 'bot' rule of
the user-agent policy, the number of developers that need to be
engaged is much larger.

Please understand the gravity of what you are imposing. Changing a
user-agent of a client is a breaking change, and any decent MediaWiki
client is also used by non-Wikimedia wikis, administrated by
non-Wikimedia ops teams, who may have their own tools doing analysis
of user-agents hitting their servers, possibly including access
control rules. And their rules and scripts may break when a client
framework changes its user-agent in order to make the Wikimedia
Analytics scripts easier. Strictly speaking your user-agent policy
proposal requires a new _major_ release for every client framework
that you do not grandfather into your proposed user-agent policy.

If you are only updating the policy to "encourage" the use of the
'bot' in the user-agent, there will not be any concerns as this is
quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.

I see your point. The addition of "bot" will be optional (as is the rest of the policy), we will make that clear in the docs.

If the proposal is to require only 'bot' in the user-agent,
pywikipediabot and pywikibot both need no change to add it (yay!, but
do we need to add 'human' to the user-agent for some scripts??), but
many client frameworks will still need to change their user-agent,
including for example both of the Go frameworks.
https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21

By doing some analysis of the existing user-agents hitting your
servers, maybe you can find an easy way to grandfather in most client
frameworks. e.g. if you also add 'github' as a bot pattern, both Go
frameworks are automatically now also supported.

[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail
address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most
well written bots.
Also including any requests from tools.wmflabs.org and friends as
'bot' might also be a useful improvement.

That is a very good insight. Thanks. Currently, the User-Agent policy is not implemented in our regular expressions, meaning: it does not match emails, nor user pages or other mediawiki urls. It could also, as you suggest, implement matching github accounts, or tools.wmflabs.org. We Analytics should tackle that. I will create a task for that and add it to the proposal.

Thanks again, in short I'll send the proposal with the changes.

On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg <jayvdb@gmail.com> wrote:

On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns <mforns@wikimedia.org> wrote:
> Hi all,
>
> It seems comments are decreasing at this point. I'd like to slowly drive
> this thread to a conclusion.
>
>
>> 3. Create a plan to block clients that dont implement the (amended)
>> User-Agent policy.
>
>
> I think we can decide on this later. Steps 1) and 2) can be done first -
> they should be done anyway before 3) - and then we can see how much benefit
> we raise from them. If we don't get a satisfactory reaction from
> bot/framework maintainers, we then can go for 3). John, would you be OK with
> that?

I think you need to clearly define what you want to capture and
classify, and re-evaluate what change to the user-agent policy will
have any noticeable impact on your detection accuracy in the next five
years.

The eventual definition of 'bot' will be very central to this issue.
Which tools need to start adding 'bot'? What is 'human' use? This
terminology problem has caused much debate on the wikis, reaching
arbcom several times. So, precision in the definition will be quite
helpful.

One of the strange area's to consider is jquery-based tools that are
effectively bots, performing large numbers of operations on pages in
batches with only high-level commands being given by a human. e.g.
the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot
scripts are also not a 'bot'.

If gadgets and user-scripts may need to follow the new 'bot' rule of
the user-agent policy, the number of developers that need to be
engaged is much larger.

If the proposal is to require only 'bot' in the user-agent,
pywikipediabot and pywikibot both need no change to add it (yay!, but
do we need to add 'human' to the user-agent for some scripts??), but
many client frameworks will still need to change their user-agent,
including for example both of the Go frameworks.
https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21

By doing some analysis of the existing user-agents hitting your
servers, maybe you can find an easy way to grandfather in most client
frameworks. e.g. if you also add 'github' as a bot pattern, both Go
frameworks are automatically now also supported.

Please understand the gravity of what you are imposing. Changing a
user-agent of a client is a breaking change, and any decent MediaWiki
client is also used by non-Wikimedia wikis, administrated by
non-Wikimedia ops teams, who may have their own tools doing analysis
of user-agents hitting their servers, possibly including access
control rules. And their rules and scripts may break when a client
framework changes its user-agent in order to make the Wikimedia
Analytics scripts easier. Strictly speaking your user-agent policy
proposal requires a new _major_ release for every client framework
that you do not grandfather into your proposed user-agent policy.

Poorly written/single-purpose/once-off clients are less of a problem,
as forcing change on them has lower impact.

[[w:User_agent]] says:

"Bots, such as Web crawlers, often also include a URL and/or e-mail
address so that the Webmaster can contact the operator of the bot."

So including URL/email as part of your detection should capture most
well written bots.
Also including any requests from tools.wmflabs.org and friends as
'bot' might also be a useful improvement.

The `analytics-refinery-source` code currently differentiates between
spider and bot, but earlier in this thread you said

'I don't think we need to differentiate between "spiders" and "bots".'

If you require 'bot' in the user-agent for bots, this will also
capture Googlebot and YandexBot, and many other tools which use 'bot'
. Do you want Googlebot to be a bot?

But Yahoo! Slurp's useragent doesnt include bot will not.

So you will still need a long regex for user-agents of tools which you
can't impose this change onto.

If you do not want Googlebot to be grouped together with api based
bots , either the user-agent need to use something more distinctive,
such as 'MediaWikiBot', or you will need another regex of all the
'bot' matches which you dont want to be a bot.

> If no-one else raises concerns about this, the Analytics team will:
>
> Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to
> encourage including the word "bot" (case-insensitive) in the User-Agent
> string, so that bots can be easily identified.

If you are only updating the policy to "encourage" the use of the
'bot' in the user-agent, there will not be any concerns as this is
quite common anyway, and it is optional. ;-)

The dispute will occur when the addition of 'bot' becomes mandatory.

--
John Vandenberg

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Marcel Ruiz Forns

Analytics Developer

Wikimedia Foundation