On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much benefit we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK with that?
I think you need to clearly define what you want to capture and classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
The eventual definition of 'bot' will be very central to this issue. Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
One of the strange area's to consider is jquery-based tools that are effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
If the proposal is to require only 'bot' in the user-agent, pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163 https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
Poorly written/single-purpose/once-off clients are less of a problem, as forcing change on them has lower impact.
[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said
'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you can't impose this change onto.
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.
-- John Vandenberg