Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can continue with the next steps.
What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from bots. We can recognize a part of that traffic with regular expressions[1], but we can not recognize all of it, because some bots do not identify themselves as such. If we could identify a greater part of the bot traffic, we could also better isolate the human traffic and permit more accurate analyses.
Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human, like browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings.
How to follow the convention? The client's user-agent string should contain the word "WikimediaBot". The word can be anywhere within the user-agent string and is case-sensitive.
So, please, feel free to post your comments/feedback on this thread. In the course of this discussion we can adjust the convention's definition and, if no major concerns are raised, in 2 weeks we'll create a documentation page in Wikitech, send an email to the proper mailing lists and maybe write a blog post about it.
Thanks a lot!
(*) There is already another convention[2] for bots that EDIT Wikimedia content.
[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... [2] https://www.mediawiki.org/wiki/Manual:Bots
On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can continue with the next steps.
What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from bots. We can recognize a part of that traffic with regular expressions[1], but we can not recognize all of it, because some bots do not identify themselves as such. If we could identify a greater part of the bot traffic, we could also better isolate the human traffic and permit more accurate analyses.
Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human, like browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings.
How to follow the convention? The client's user-agent string should contain the word "WikimediaBot". The word can be anywhere within the user-agent string and is case-sensitive.
So, please, feel free to post your comments/feedback on this thread. In the course of this discussion we can adjust the convention's definition and, if no major concerns are raised, in 2 weeks we'll create a documentation page in Wikitech, send an email to the proper mailing lists and maybe write a blog post about it.
Thanks a lot!
(*) There is already another convention[2] for bots that EDIT Wikimedia content.
[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... [2] https://www.mediawiki.org/wiki/Manual:Bots
The number one User-Agent for hits to api.php is "-" despite [[:meta:User-Agent_policy]] [3]. I think we can do a bit of outreach and get some bot maintainers and maybe a framework or two to implement this. I would be very surprised if it went much further than that unless we actually implement some sort of enforcement policy. That being said I completely understand that keeping a regex or other matching system isn't scalable at all and this convention seems like a reasonable way to start chipping away at the problem.
[3]: https://meta.wikimedia.org/wiki/User-Agent_policy
On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
(*) There is already another convention[2] for bots that EDIT Wikimedia content. [2] https://www.mediawiki.org/wiki/Manual:Bots
I don't see anything described in [[Manual:Bots]] that would actually help you detect bot editing traffic. What am I missing? Bots (or any high volume editing user) are typically asked to get specific rights on the wikis that they are interacting with but those on wiki permissions are associated with the wiki account and not the user agent.
Bryan
I wonder if Marcel means "crawlers".
On Wednesday, January 27, 2016, Bryan Davis bd808@wikimedia.org wrote:
On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns <mforns@wikimedia.org javascript:;> wrote:
(*) There is already another convention[2] for bots that EDIT Wikimedia content. [2] https://www.mediawiki.org/wiki/Manual:Bots
I don't see anything described in [[Manual:Bots]] that would actually help you detect bot editing traffic. What am I missing? Bots (or any high volume editing user) are typically asked to get specific rights on the wikis that they are interacting with but those on wiki permissions are associated with the wiki account and not the user agent.
Bryan
Bryan Davis Wikimedia Foundation <bd808@wikimedia.org javascript:;> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
Yeah, I don't see anything at [[Manual:Bots]] that mentions a user-agent convention for bots that edit Wikipedia. As far as I know, there isn't any. Most editing bots either use a user-agent string set by the framework they are using or have a completely unique user-agent string. It seems like it would also be a good idea to create a convention for bots that edit.
On Wed, Jan 27, 2016 at 8:19 PM, Bryan Davis bd808@wikimedia.org wrote:
On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
(*) There is already another convention[2] for bots that EDIT Wikimedia content. [2] https://www.mediawiki.org/wiki/Manual:Bots
I don't see anything described in [[Manual:Bots]] that would actually help you detect bot editing traffic. What am I missing? Bots (or any high volume editing user) are typically asked to get specific rights on the wikis that they are interacting with but those on wiki permissions are associated with the wiki account and not the user agent.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 28, 2016 at 11:15 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can continue with the next steps.
What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from bots. We can recognize a part of that traffic with regular expressions[1], but we can not recognize all of it, because some bots do not identify themselves as such. If we could identify a greater part of the bot traffic, we could also better isolate the human traffic and permit more accurate analyses.
Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human, like browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings.
How to follow the convention? The client's user-agent string should contain the word "WikimediaBot". The word can be anywhere within the user-agent string and is case-sensitive.
This is useless unless someone is going to start blocking bots that dont follow it.
There is an existing policy, which is not being followed / enforced.
https://meta.wikimedia.org/wiki/User-Agent_policy
It is also extremely annoying that clients (e.g. Pywikibot) now needs to add a Wikimedia specific tag to their user-agent. A user-agent should be client specific, not server specific. Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites that the client can communicate with.
A user-agent should be client specific, not server specific.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.
On Wed, Jan 27, 2016 at 8:47 PM, John Mark Vandenberg jayvdb@gmail.com wrote:
On Thu, Jan 28, 2016 at 11:15 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can
continue
with the next steps.
What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from
bots.
We can recognize a part of that traffic with regular expressions[1], but
we
can not recognize all of it, because some bots do not identify
themselves as
such. If we could identify a greater part of the bot traffic, we could
also
better isolate the human traffic and permit more accurate analyses.
Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered
way.
Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human,
like
browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings.
How to follow the convention? The client's user-agent string should contain the word "WikimediaBot".
The
word can be anywhere within the user-agent string and is case-sensitive.
This is useless unless someone is going to start blocking bots that dont follow it.
There is an existing policy, which is not being followed / enforced.
https://meta.wikimedia.org/wiki/User-Agent_policy
It is also extremely annoying that clients (e.g. Pywikibot) now needs to add a Wikimedia specific tag to their user-agent. A user-agent should be client specific, not server specific. Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites that the client can communicate with.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 28, 2016 at 11:47 AM, Nuria Ruiz nuria@wikimedia.org wrote:
A user-agent should be client specific, not server specific.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.
Anything with "bot" (case-insensitive) in the UA is already caught by the "spiderPattern" regex [0]. The rest of the new logic in Webrequest related to this feature seems to take that into account by making isSpider() check that wikimediaBotPattern is not matched. I guess a bigger question is why try to differentiate between "spiders" and "bots" at all?
[0]: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-...
Bryan
Marcel Ruiz Forns, 28/01/2016 01:15:
we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves.
So this email is not meant to advertise the convention, right? Because the audience of this mailing list certainly doesn't include crawler operators.
[...] (*) There is already another convention[2] for bots that EDIT Wikimedia content.
[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... [2] https://www.mediawiki.org/wiki/Manual:Bots
I support this page was linked because from there one can click [[API:Client code]] > https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header > https://meta.wikimedia.org/wiki/User-Agent_policy which is the mentioned conventioned.
Nemo
I don't see anything described in [[Manual:Bots]] that would actually help you detect bot editing traffic.
&
Yeah, I don't see anything at [[Manual:Bots]] that mentions a user-agent convention for bots that edit Wikipedia.
&
I support this page was linked because from there one can click [[API:Client code]] > https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header > https://meta.wikimedia.org/wiki/User-Agent_policy which is the mentioned conventioned.
Sorry, I pasted a wrong link, the correct one is: https://en.wikipedia.org/wiki/Wikipedia:Bot_policy
---
I wonder if Marcel means "crawlers".
Toby, do you mean when referring to spiders? Yes, I think they are equivalent terms. Do you think we should change the naming there?
---
Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites
that the client can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better semantics.
---
So this email is not meant to advertise the convention, right? Because the
audience of this mailing list certainly doesn't include crawler operators.
No, it's not. It is to reach consensus on the convention and identify things that we can do to improve its application. Thanks for pointing that out, it was unclear in the initial email.
On Thu, Jan 28, 2016 at 9:18 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Marcel Ruiz Forns, 28/01/2016 01:15:
we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves.
So this email is not meant to advertise the convention, right? Because the audience of this mailing list certainly doesn't include crawler operators.
[...]
(*) There is already another convention[2] for bots that EDIT Wikimedia content.
[1]
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... [2] https://www.mediawiki.org/wiki/Manual:Bots
I support this page was linked because from there one can click [[API:Client code]] > https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header > https://meta.wikimedia.org/wiki/User-Agent_policy which is the mentioned conventioned.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 28, 2016 at 4:27 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
I wonder if Marcel means "crawlers".
Toby, do you mean when referring to spiders? Yes, I think they are equivalent terms. Do you think we should change the naming there?
Hi Marcel --
Here's some documentation from Google where they use bot, spider and crawler to refer to the same thing.
Googlebot is Google's web crawling bot (sometimes also called a "spider"). Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index.
Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify.
-Toby
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote:
Why not just "Bot", or "MediaWikiBot" which at least encompasses all
sites that the client
can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better
semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from analytics.
-- John Vandenberg
Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into that, I've seen that some bots, in the process of doing their "editing" work can also generate pageviews. So we should also include them as potential source of pageview traffic. Maybe we can reuse the existing User-Agent policy.
This makes a lot of sense. If I build a bot that crawls wikipedia and
facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.
Totally agree.
I guess a bigger question is why try to differentiate between "spiders" and "bots"
at all?
I don't think we need to differentiate between "spiders" and "bots". The most important question we want to respond is: how much of the traffic we consider "human" today is actually "bot". So, +1 "bot" (case-insensitive).
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg jayvdb@gmail.com wrote:
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote:
Why not just "Bot", or "MediaWikiBot" which at least encompasses all
sites that the client
can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better
semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from analytics.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
So, trying to join everyone's points of view, what about?
1. Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other pages we think necessary.
2. Do some advertising and outreach and get some bot maintainers and maybe some frameworks to implement the User-Agent policy. This would make the existing policy less useless.
Thanks all for the feedback!
On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Clearly Wikipedia et al. uses bot to refer to automated software that
edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into that, I've seen that some bots, in the process of doing their "editing" work can also generate pageviews. So we should also include them as potential source of pageview traffic. Maybe we can reuse the existing User-Agent policy.
This makes a lot of sense. If I build a bot that crawls wikipedia and
facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.
Totally agree.
I guess a bigger question is why try to differentiate between "spiders"
and "bots" at all?
I don't think we need to differentiate between "spiders" and "bots". The most important question we want to respond is: how much of the traffic we consider "human" today is actually "bot". So, +1 "bot" (case-insensitive).
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg jayvdb@gmail.com wrote:
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote:
Why not just "Bot", or "MediaWikiBot" which at least encompasses all
sites that the client
can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better
semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from analytics.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
Hi Marcel,
It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained. We dont know how much traffic is generated by compat. There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
There are a lot of clients that need to be upgraded or be decommissioned for this 'add bot' strategy to be effective in the near future. see https://www.mediawiki.org/wiki/API:Client_code
The all important missing step is
3. Create a plan to block clients that dont implement the (amended) User-Agent policy.
Without that plan, successfully implemented, you will not get quality data (i.e. using 'Netscape' in the U-A to guess 'human' would perform better).
On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
So, trying to join everyone's points of view, what about?
Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other pages we think necessary.
Do some advertising and outreach and get some bot maintainers and maybe some frameworks to implement the User-Agent policy. This would make the existing policy less useless.
Thanks all for the feedback!
On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into that, I've seen that some bots, in the process of doing their "editing" work can also generate pageviews. So we should also include them as potential source of pageview traffic. Maybe we can reuse the existing User-Agent policy.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.
Totally agree.
I guess a bigger question is why try to differentiate between "spiders" and "bots" at all?
I don't think we need to differentiate between "spiders" and "bots". The most important question we want to respond is: how much of the traffic we consider "human" today is actually "bot". So, +1 "bot" (case-insensitive).
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg jayvdb@gmail.com wrote:
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote:
Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites that the client can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from analytics.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks, John, for your explanation. I can totally see your point.
In the past, the Analytics team also considered enforcing the convention by blocking those bots that don't follow it. And that is still an option to consider.
One question to all, though: Are we consdering bots that scrape Wikimedia sites directly (not using the api)? And are we considering humans that manually use the api? It's very difficult to identify those, and we could easily end up with false positives and blocking legitimate requests, no?
Another option to this thread would be: cancelling the convention and continue working on regexps and other analyses (like the one made for last-access devices).
More thoughts?
On Mon, Feb 1, 2016 at 3:39 PM, John Mark Vandenberg jayvdb@gmail.com wrote:
Hi Marcel,
It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained. We dont know how much traffic is generated by compat. There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
There are a lot of clients that need to be upgraded or be decommissioned for this 'add bot' strategy to be effective in the near future. see https://www.mediawiki.org/wiki/API:Client_code
The all important missing step is
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
Without that plan, successfully implemented, you will not get quality data (i.e. using 'Netscape' in the U-A to guess 'human' would perform better).
On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
So, trying to join everyone's points of view, what about?
Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other
pages
we think necessary.
Do some advertising and outreach and get some bot maintainers and maybe
some
frameworks to implement the User-Agent policy. This would make the
existing
policy less useless.
Thanks all for the feedback!
On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer
to all
automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into that, I've seen that some bots, in the process of doing their "editing" work
can
also generate pageviews. So we should also include them as potential
source
of pageview traffic. Maybe we can reuse the existing User-Agent policy.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be
enough.
Totally agree.
I guess a bigger question is why try to differentiate between "spiders" and "bots" at all?
I don't think we need to differentiate between "spiders" and "bots". The most important question we want to respond is: how much of the traffic
we
consider "human" today is actually "bot". So, +1 "bot"
(case-insensitive).
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <jayvdb@gmail.com
wrote:
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote:
Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites that the client can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from
analytics.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained.
That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.
There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
Sorry that the tagging is confusing. I think Analytics tag was removed cause this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team. I am cc-ing Bryan who has already done some analysis on bots requests to the API and can probably provide some data.
On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg jayvdb@gmail.com wrote:
Hi Marcel,
It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained. We dont know how much traffic is generated by compat. There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
There are a lot of clients that need to be upgraded or be decommissioned for this 'add bot' strategy to be effective in the near future. see https://www.mediawiki.org/wiki/API:Client_code
The all important missing step is
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
Without that plan, successfully implemented, you will not get quality data (i.e. using 'Netscape' in the U-A to guess 'human' would perform better).
On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
So, trying to join everyone's points of view, what about?
Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other
pages
we think necessary.
Do some advertising and outreach and get some bot maintainers and maybe
some
frameworks to implement the User-Agent policy. This would make the
existing
policy less useless.
Thanks all for the feedback!
On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer
to all
automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into that, I've seen that some bots, in the process of doing their "editing" work
can
also generate pageviews. So we should also include them as potential
source
of pageview traffic. Maybe we can reuse the existing User-Agent policy.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be
enough.
Totally agree.
I guess a bigger question is why try to differentiate between "spiders" and "bots" at all?
I don't think we need to differentiate between "spiders" and "bots". The most important question we want to respond is: how much of the traffic
we
consider "human" today is actually "bot". So, +1 "bot"
(case-insensitive).
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <jayvdb@gmail.com
wrote:
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote:
Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites that the client can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from
analytics.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
In the past, the Analytics team also considered enforcing the convention
by blocking those bots that don't follow it. And that is still an option to consider. I would like to point out that I think this is probably the prerogative of api's team rather than analytics.
Another option to this thread would be: cancelling the convention and
continue working on regexps I think regardless of our convention we will always be doing regex detection of self-identified bots. Maybe I am missing some nuance here?
On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz nuria@wikimedia.org wrote:
It will take time for frameworks to implement an amended User-Agent
policy.
For example, pywikipedia (pywikibot compat) is not actively maintained.
That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.
There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
Sorry that the tagging is confusing. I think Analytics tag was removed cause this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team. I am cc-ing Bryan who has already done some analysis on bots requests to the API and can probably provide some data.
On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg jayvdb@gmail.com wrote:
Hi Marcel,
It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained. We dont know how much traffic is generated by compat. There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
There are a lot of clients that need to be upgraded or be decommissioned for this 'add bot' strategy to be effective in the near future. see https://www.mediawiki.org/wiki/API:Client_code
The all important missing step is
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
Without that plan, successfully implemented, you will not get quality data (i.e. using 'Netscape' in the U-A to guess 'human' would perform better).
On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
So, trying to join everyone's points of view, what about?
Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy
and
modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other
pages
we think necessary.
Do some advertising and outreach and get some bot maintainers and maybe
some
frameworks to implement the User-Agent policy. This would make the
existing
policy less useless.
Thanks all for the feedback!
On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mforns@wikimedia.org
wrote:
Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer
to all
automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into that, I've seen that some bots, in the process of doing their "editing" work
can
also generate pageviews. So we should also include them as potential
source
of pageview traffic. Maybe we can reuse the existing User-Agent policy.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be
enough.
Totally agree.
I guess a bigger question is why try to differentiate between
"spiders"
and "bots" at all?
I don't think we need to differentiate between "spiders" and "bots".
The
most important question we want to respond is: how much of the traffic
we
consider "human" today is actually "bot". So, +1 "bot"
(case-insensitive).
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <
jayvdb@gmail.com>
wrote:
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote:
> > Why not just "Bot", or "MediaWikiBot" which at least encompasses
all
> sites that the client > can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from
analytics.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Another option to this thread would be: cancelling the convention and
continue working on regexps I think regardless of our convention we will always be doing regex detection of self-identified bots. Maybe I am missing some nuance here?
No, no, Nuria, you're right. I meant continue to improve the regexps and the other means we have to identify bots. I didn't imply that we should stop doing regexps if we establish the convention.
On Mon, Feb 1, 2016 at 7:44 PM, Nuria Ruiz nuria@wikimedia.org wrote:
In the past, the Analytics team also considered enforcing the convention
by blocking those bots that don't follow it. And that is still an option to consider. I would like to point out that I think this is probably the prerogative of api's team rather than analytics.
Another option to this thread would be: cancelling the convention and
continue working on regexps I think regardless of our convention we will always be doing regex detection of self-identified bots. Maybe I am missing some nuance here?
On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz nuria@wikimedia.org wrote:
It will take time for frameworks to implement an amended User-Agent
policy.
For example, pywikipedia (pywikibot compat) is not actively maintained.
That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.
There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
Sorry that the tagging is confusing. I think Analytics tag was removed cause this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team. I am cc-ing Bryan who has already done some analysis on bots requests to the API and can probably provide some data.
On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg jayvdb@gmail.com wrote:
Hi Marcel,
It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained. We dont know how much traffic is generated by compat. There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
There are a lot of clients that need to be upgraded or be decommissioned for this 'add bot' strategy to be effective in the near future. see https://www.mediawiki.org/wiki/API:Client_code
The all important missing step is
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
Without that plan, successfully implemented, you will not get quality data (i.e. using 'Netscape' in the U-A to guess 'human' would perform better).
On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
So, trying to join everyone's points of view, what about?
Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy
and
modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in
the
anlytics cluster (no regexps). And link that page from whatever other
pages
we think necessary.
Do some advertising and outreach and get some bot maintainers and
maybe some
frameworks to implement the User-Agent policy. This would make the
existing
policy less useless.
Thanks all for the feedback!
On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <
mforns@wikimedia.org>
wrote:
Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer
to all
automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into
that,
I've seen that some bots, in the process of doing their "editing"
work can
also generate pageviews. So we should also include them as potential
source
of pageview traffic. Maybe we can reuse the existing User-Agent
policy.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be
enough.
Totally agree.
I guess a bigger question is why try to differentiate between
"spiders"
and "bots" at all?
I don't think we need to differentiate between "spiders" and "bots".
The
most important question we want to respond is: how much of the
traffic we
consider "human" today is actually "bot". So, +1 "bot"
(case-insensitive).
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <
jayvdb@gmail.com>
wrote:
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org wrote: >> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses
all
>> sites that the client >> can communicate with. > > I personally agree with you, "MediaWikiBot" seems to have better > semantics.
For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from
analytics.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Mon, Feb 1, 2016 at 11:42 AM, Nuria Ruiz nuria@wikimedia.org wrote:
It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained.
That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.
There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
Sorry that the tagging is confusing. I think Analytics tag was removed cause this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team. I am cc-ing Bryan who has already done some analysis on bots requests to the API and can probably provide some data.
It would be possible to make some relative comparisons of pywikibot versions using the data that is currently collected in the wmf.webrequest data set. "Someday" I'll get T108618 [0] finished which will make answering some of the more granular questions in T99373 easier. Kunal talked with Brad and I a few weeks ago when we were all in SF for the DevSummit about other instrumentation that could be put in place specifically for pywikibot so that something like Special:ApiFeatureUsage [2] could be created for pywikibot version tracking as well. This all seems like a fork of the topic at hand however.
[0]: https://phabricator.wikimedia.org/T108618 [1]: https://phabricator.wikimedia.org/T99373 [2]: https://en.wikipedia.org/wiki/Special:ApiFeatureUsage
Bryan
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
3. Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much benefit we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK with that?
If no-one else raises concerns about this, the Analytics team will:
1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
2. Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy.
Thanks!
On Tue, Feb 2, 2016 at 5:21 AM, Bryan Davis bd808@wikimedia.org wrote:
On Mon, Feb 1, 2016 at 11:42 AM, Nuria Ruiz nuria@wikimedia.org wrote:
It will take time for frameworks to implement an amended User-Agent
policy.
For example, pywikipedia (pywikibot compat) is not actively maintained.
That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.
There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
Sorry that the tagging is confusing. I think Analytics tag was removed
cause
this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team. I am cc-ing Bryan who has already done some analysis on bots requests to
the
API and can probably provide some data.
It would be possible to make some relative comparisons of pywikibot versions using the data that is currently collected in the wmf.webrequest data set. "Someday" I'll get T108618 [0] finished which will make answering some of the more granular questions in T99373 easier. Kunal talked with Brad and I a few weeks ago when we were all in SF for the DevSummit about other instrumentation that could be put in place specifically for pywikibot so that something like Special:ApiFeatureUsage [2] could be created for pywikibot version tracking as well. This all seems like a fork of the topic at hand however.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, Feb 2, 2016 at 12:40 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much benefit we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK with that?
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy.
The proposed plan sounds good to me.
I'm very strongly against the suggestion of blocking anyone's access to api.php or the wikis in general over not having "bot" in the user-agent string however. Getting cleaner analytics is a nice goal but the point of the projects is to collect and disseminate information. You might get blocked for doing something deliberately harmful to the services or the community, but getting blocked for not following an arbitrary convention that causes no real harm is extreme. You will quickly find yourself in a strange conundrum as well. To block you will need to establish intent of the User Agent and if you can do that then you probably don't need the "bot" tagging convention in the first place.
Bryan
I'm very strongly against the suggestion of blocking anyone's access to api.php or the wikis in general over not having "bot" in the user-agent string however. Getting cleaner analytics is a nice goal but the point of the projects is to collect and disseminate information.
Much agreed.
On Tue, Feb 2, 2016 at 1:29 PM, Bryan Davis bd808@wikimedia.org wrote:
On Tue, Feb 2, 2016 at 12:40 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much
benefit
we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK
with
that?
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy.
The proposed plan sounds good to me.
I'm very strongly against the suggestion of blocking anyone's access to api.php or the wikis in general over not having "bot" in the user-agent string however. Getting cleaner analytics is a nice goal but the point of the projects is to collect and disseminate information. You might get blocked for doing something deliberately harmful to the services or the community, but getting blocked for not following an arbitrary convention that causes no real harm is extreme. You will quickly find yourself in a strange conundrum as well. To block you will need to establish intent of the User Agent and if you can do that then you probably don't need the "bot" tagging convention in the first place.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much benefit we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK with that?
I think you need to clearly define what you want to capture and classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
The eventual definition of 'bot' will be very central to this issue. Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
One of the strange area's to consider is jquery-based tools that are effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
If the proposal is to require only 'bot' in the user-agent, pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163 https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
Poorly written/single-purpose/once-off clients are less of a problem, as forcing change on them has lower impact.
[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said
'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you can't impose this change onto.
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.
-- John Vandenberg
John, thank you a lot for taking the time to answer my question. My responses inline (I rearranged some of your paragraphs to respond to them together):
I think you need to clearly define what you want to capture and
classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
&
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
&
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said 'I don't think we need to differentiate between "spiders" and "bots".' If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot? But Yahoo! Slurp's useragent doesnt include bot will not. So you will still need a long regex for user-agents of tools which you can't impose this change onto.
Differentiating between "spiders" and "bots" can be very tricky, as you explain. There was some work on it in the past, but what we really want at the moment is: to split the human vs bot traffic with a higher accuracy. I will add that to the docs, thanks. Regarding measuring the impact, as we'll not be able to differentiate "spiders" and "bots", we can only observe the variations of the human vs bot traffic rates in time and try to associate those to recent changes in User-Agent strings or regular expressions.
The eventual definition of 'bot' will be very central to this issue.
Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
Agree, will add that to the proposal.
One of the strange area's to consider is jquery-based tools that are
effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
I think the key here is: the program should be tagged as a bot by analytics, if it generates pageviews not consumed onsite by a human. I will mention that in the docs, too. Thanks.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
&
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
&
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-) The dispute will occur when the addition of 'bot' becomes mandatory.
I see your point. The addition of "bot" will be optional (as is the rest of the policy), we will make that clear in the docs.
If the proposal is to require only 'bot' in the user-agent,
pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
&
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
&
[[w:User_agent]] says: "Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot." So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
That is a very good insight. Thanks. Currently, the User-Agent policy is not implemented in our regular expressions, meaning: it does not match emails, nor user pages or other mediawiki urls. It could also, as you suggest, implement matching github accounts, or tools.wmflabs.org. We Analytics should tackle that. I will create a task for that and add it to the proposal.
Thanks again, in short I'll send the proposal with the changes.
On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg jayvdb@gmail.com wrote:
On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much
benefit
we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK
with
that?
I think you need to clearly define what you want to capture and classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
The eventual definition of 'bot' will be very central to this issue. Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
One of the strange area's to consider is jquery-based tools that are effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
If the proposal is to require only 'bot' in the user-agent, pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
Poorly written/single-purpose/once-off clients are less of a problem, as forcing change on them has lower impact.
[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said
'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you can't impose this change onto.
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi again analytics list,
Thank you all for your comments and feedback! We consider this thread closed and will now proceed to:
1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including (optional) the word "bot" (case-insensitive) in the User-Agent string, so that bots that generate pageviews not consumed onsite by humans can be easily identified by the Analytics cluster, thus increasing accuracy of the human-vs-bot traffic split.
2. Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy.
3. The Analytics team should implement the regular expressions that match the current User-Agent policy: User-Agent strings with: emails, user pages, other mediawiki urls, github urls, and tools.wmflabs.org urls. This will take some time, and probably raise technical issues, but seems that we can benefit from it. https://phabricator.wikimedia.org/T125731
Cheers!
On Wed, Feb 3, 2016 at 11:43 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
John, thank you a lot for taking the time to answer my question. My responses inline (I rearranged some of your paragraphs to respond to them together):
I think you need to clearly define what you want to capture and
classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
&
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
&
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said 'I don't think we need to differentiate between "spiders" and "bots".' If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot? But Yahoo! Slurp's useragent doesnt include bot will not. So you will still need a long regex for user-agents of tools which you can't impose this change onto.
Differentiating between "spiders" and "bots" can be very tricky, as you explain. There was some work on it in the past, but what we really want at the moment is: to split the human vs bot traffic with a higher accuracy. I will add that to the docs, thanks. Regarding measuring the impact, as we'll not be able to differentiate "spiders" and "bots", we can only observe the variations of the human vs bot traffic rates in time and try to associate those to recent changes in User-Agent strings or regular expressions.
The eventual definition of 'bot' will be very central to this issue.
Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
Agree, will add that to the proposal.
One of the strange area's to consider is jquery-based tools that are
effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
I think the key here is: the program should be tagged as a bot by analytics, if it generates pageviews not consumed onsite by a human. I will mention that in the docs, too. Thanks.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
&
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
&
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-) The dispute will occur when the addition of 'bot' becomes mandatory.
I see your point. The addition of "bot" will be optional (as is the rest of the policy), we will make that clear in the docs.
If the proposal is to require only 'bot' in the user-agent,
pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
&
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
&
[[w:User_agent]] says: "Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot." So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
That is a very good insight. Thanks. Currently, the User-Agent policy is not implemented in our regular expressions, meaning: it does not match emails, nor user pages or other mediawiki urls. It could also, as you suggest, implement matching github accounts, or tools.wmflabs.org. We Analytics should tackle that. I will create a task for that and add it to the proposal.
Thanks again, in short I'll send the proposal with the changes.
On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg jayvdb@gmail.com wrote:
On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi all,
It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.
- Create a plan to block clients that dont implement the (amended)
User-Agent policy.
I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much
benefit
we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK
with
that?
I think you need to clearly define what you want to capture and classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.
The eventual definition of 'bot' will be very central to this issue. Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.
One of the strange area's to consider is jquery-based tools that are effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.
If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.
If the proposal is to require only 'bot' in the user-agent, pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad...
By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.
Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.
Poorly written/single-purpose/once-off clients are less of a problem, as forcing change on them has lower impact.
[[w:User_agent]] says:
"Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot."
So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.
The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said
'I don't think we need to differentiate between "spiders" and "bots".'
If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot?
But Yahoo! Slurp's useragent doesnt include bot will not.
So you will still need a long regex for user-agents of tools which you can't impose this change onto.
If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.
If no-one else raises concerns about this, the Analytics team will:
Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.
If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-)
The dispute will occur when the addition of 'bot' becomes mandatory.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation