John, thank you a lot for taking the time to answer my question. My responses inline (I rearranged some of your paragraphs to respond to them together):
I think you need to clearly define what you want to capture and
classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
&
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
&
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said 'I don't think we need to differentiate between "spiders" and "bots".' If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot? But Yahoo! Slurp's useragent doesnt include bot will not. So you will still need a long regex for user-agents of tools which you can't impose this change onto.
Differentiating between "spiders" and "bots" can be very tricky, as you explain. There was some work on it in the past, but what we really want at the moment is: to split the human vs bot traffic with a higher accuracy. I will add that to the docs, thanks. Regarding measuring the impact, as we'll not be able to differentiate "spiders" and "bots", we can only observe the variations of the human vs bot traffic rates in time and try to associate those to recent changes in User-Agent strings or regular expressions.
The eventual definition of 'bot' will be very central to this issue.
Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
Agree, will add that to the proposal.
One of the strange area's to consider is jquery-based tools that are
effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
I think the key here is: the program should be tagged as a bot by analytics, if it generates pageviews not consumed onsite by a human. I will mention that in the docs, too. Thanks.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
&
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
&
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-) The dispute will occur when the addition of 'bot' becomes mandatory.
I see your point. The addition of "bot" will be optional (as is the rest of the policy), we will make that clear in the docs.
If the proposal is to require only 'bot' in the user-agent,
pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
&
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
&
[[w:User_agent]] says: "Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot." So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
That is a very good insight. Thanks. Currently, the User-Agent policy is not implemented in our regular expressions, meaning: it does not match emails, nor user pages or other mediawiki urls. It could also, as you suggest, implement matching github accounts, or tools.wmflabs.org. We Analytics should tackle that. I will create a task for that and add it to the proposal.
Thanks again, in short I'll send the proposal with the changes.
On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg jayvdb@gmail.com wrote:
On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much
benefit
we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK
with
that?
I think you need to clearly define what you want to capture and classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
The eventual definition of 'bot' will be very central to this issue. Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
One of the strange area's to consider is jquery-based tools that are effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
If the proposal is to require only 'bot' in the user-agent, pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
Poorly written/single-purpose/once-off clients are less of a problem, as forcing change on them has lower impact.
[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said
'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you can't impose this change onto.
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics