WikimediaBot convention

List overview All Threads
Download

newer

older

Re: [Analytics] Analytics Digest,...

Pageview stats tools

Marcel Ruiz Forns

28 Jan 2016 28 Jan '16

2:15 a.m.

Hi analytics list, In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can continue with the next steps. What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from bots. We can recognize a part of that traffic with regular expressions[1], but we can not recognize all of it, because some bots do not identify themselves as such. If we could identify a greater part of the bot traffic, we could also better isolate the human traffic and permit more accurate analyses. Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered way. Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human, like browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings. How to follow the convention? The client's user-agent string should contain the word "WikimediaBot". The word can be anywhere within the user-agent string and is case-sensitive. So, please, feel free to post your comments/feedback on this thread. In the course of this discussion we can adjust the convention's definition and, if no major concerns are raised, in 2 weeks we'll create a documentation page in Wikitech, send an email to the proper mailing lists and maybe write a blog post about it. Thanks a lot! (*) There is already another convention[2] for bots that EDIT Wikimedia content. [1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery… [2] https://www.mediawiki.org/wiki/Manual:Bots -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Attachments:

attachment.htm (text/html — 2.7 KB)

Show replies by date

Bryan Davis

28 Jan 28 Jan

2:22 a.m.

On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

The number one User-Agent for hits to api.php is "-" despite [[:meta:User-Agent_policy]] [3]. I think we can do a bit of outreach and get some bot maintainers and maybe a framework or two to implement this. I would be very surprised if it went much further than that unless we actually implement some sort of enforcement policy. That being said I completely understand that keeping a regex or other matching system isn't scalable at all and this convention seems like a reasonable way to start chipping away at the problem. [3]: https://meta.wikimedia.org/wiki/User-Agent_policy -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

Bryan Davis

4:19 a.m.

On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

(*) There is already another convention[2] for bots that EDIT Wikimedia content. [2] https://www.mediawiki.org/wiki/Manual:Bots

I don't see anything described in [[Manual:Bots]] that would actually help you detect bot editing traffic. What am I missing? Bots (or any high volume editing user) are typically asked to get specific rights on the wikis that they are interacting with but those on wiki permissions are associated with the wiki account and not the user agent. Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

Toby Negrin

4:30 a.m.

I wonder if Marcel means "crawlers". On Wednesday, January 27, 2016, Bryan Davis <bd808(a)wikimedia.org> wrote:

...

On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org <javascript:;>> wrote:

(*) There is already another convention[2] for bots that EDIT Wikimedia content. [2] https://www.mediawiki.org/wiki/Manual:Bots

I don't see anything described in [[Manual:Bots]] that would actually help you detect bot editing traffic. What am I missing? Bots (or any high volume editing user) are typically asked to get specific rights on the wikis that they are interacting with but those on wiki permissions are associated with the wiki account and not the user agent. Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org <javascript:;>> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855 _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org <javascript:;> https://lists.wikimedia.org/mailman/listinfo/analytics

Ryan Kaldari

5:48 a.m.

Yeah, I don't see anything at [[Manual:Bots]] that mentions a user-agent convention for bots that edit Wikipedia. As far as I know, there isn't any. Most editing bots either use a user-agent string set by the framework they are using or have a completely unique user-agent string. It seems like it would also be a good idea to create a convention for bots that edit. On Wed, Jan 27, 2016 at 8:19 PM, Bryan Davis <bd808(a)wikimedia.org> wrote:

...

On Wed, Jan 27, 2016 at 5:15 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

(*) There is already another convention[2] for bots that EDIT Wikimedia content. [2] https://www.mediawiki.org/wiki/Manual:Bots

John Mark Vandenberg

6:47 a.m.

On Thu, Jan 28, 2016 at 11:15 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

This is useless unless someone is going to start blocking bots that dont follow it. There is an existing policy, which is not being followed / enforced. https://meta.wikimedia.org/wiki/User-Agent_policy It is also extremely annoying that clients (e.g. Pywikibot) now needs to add a Wikimedia specific tag to their user-agent. A user-agent should be client specific, not server specific. Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites that the client can communicate with. -- John Vandenberg

Nuria Ruiz

8:47 p.m.

...

A user-agent should be client specific, not server specific.

This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough. On Wed, Jan 27, 2016 at 8:47 PM, John Mark Vandenberg <jayvdb(a)gmail.com> wrote:

...

On Thu, Jan 28, 2016 at 11:15 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

continue

with the next steps. What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from

bots.

We can recognize a part of that traffic with regular expressions[1], but

can not recognize all of it, because some bots do not identify

themselves as

such. If we could identify a greater part of the bot traffic, we could

also

better isolate the human traffic and permit more accurate analyses. Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered

way.

Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human,

browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings. How to follow the convention? The client's user-agent string should contain the word "WikimediaBot".

The

word can be anywhere within the user-agent string and is case-sensitive.

Bryan Davis

29 Jan 29 Jan

2 a.m.

On Thu, Jan 28, 2016 at 11:47 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:

...

A user-agent should be client specific, not server specific.

Anything with "bot" (case-insensitive) in the UA is already caught by the "spiderPattern" regex [0]. The rest of the new logic in Webrequest related to this feature seems to take that into account by making isSpider() check that wikimediaBotPattern is not matched. I guess a bigger question is why try to differentiate between "spiders" and "bots" at all? [0]: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery… Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

Federico Leva (Nemo)

28 Jan 28 Jan

10:18 a.m.

Marcel Ruiz Forns, 28/01/2016 01:15:

...

we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves.

So this email is not meant to advertise the convention, right? Because the audience of this mailing list certainly doesn't include crawler operators.

...

[...] (*) There is already another convention[2] for bots that EDIT Wikimedia content. [1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery… [2] https://www.mediawiki.org/wiki/Manual:Bots

I support this page was linked because from there one can click [[API:Client code]] > https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header > https://meta.wikimedia.org/wiki/User-Agent_policy which is the mentioned conventioned. Nemo

Marcel Ruiz Forns

2:27 p.m.

...

I don't see anything described in [[Manual:Bots]] that would actually help you detect bot editing traffic.

...

Yeah, I don't see anything at [[Manual:Bots]] that mentions a user-agent convention for bots that edit Wikipedia.

...

Sorry, I pasted a wrong link, the correct one is: https://en.wikipedia.org/wiki/Wikipedia:Bot_policy --- I wonder if Marcel means "crawlers". Toby, do you mean when referring to spiders? Yes, I think they are equivalent terms. Do you think we should change the naming there? --- Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites

...

that the client can communicate with.

I personally agree with you, "MediaWikiBot" seems to have better semantics. --- So this email is not meant to advertise the convention, right? Because the

...

audience of this mailing list certainly doesn't include crawler operators.

No, it's not. It is to reach consensus on the convention and identify things that we can do to improve its application. Thanks for pointing that out, it was unclear in the initial email. On Thu, Jan 28, 2016 at 9:18 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote: > Marcel Ruiz Forns, 28/01/2016 01:15: > >> we (Analytics team) never finished establishing and >> advertising it. In this email we explain what the convention is today >> and what purpose it serves. >> > > So this email is not meant to advertise the convention, right? Because the

...

audience of this mailing list certainly doesn't include crawler operators.

> > [...] >> (*) There is already another convention[2] for bots that EDIT Wikimedia >> content. >> >> [1] >> >> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery… >> [2] https://www.mediawiki.org/wiki/Manual:Bots >> >

...

> > Nemo > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Toby Negrin

4:45 p.m.

On Thu, Jan 28, 2016 at 4:27 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

I wonder if Marcel means "crawlers". Toby, do you mean when referring to spiders? Yes, I think they are equivalent terms. Do you think we should change the naming there?

Hi Marcel -- Here's some documentation from Google where they use bot, spider and crawler to refer to the same thing.

...

Googlebot is Google's web crawling bot (sometimes also called a "spider"). Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index.

Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify. -Toby

John Mark Vandenberg

29 Jan 29 Jan

10:16 p.m.

On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> wrote:

...

> > Why not just "Bot", or "MediaWikiBot" which at least encompasses all

sites that the client

...

can communicate with.

I personally agree with you, "MediaWikiBot" seems to have better

semantics. For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from analytics. -- John Vandenberg

Marcel Ruiz Forns

1 Feb 1 Feb

4:16 p.m.

...

Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify.

OK, in the documentation we can make that clear. And looking into that, I've seen that some bots, in the process of doing their "editing" work can also generate pageviews. So we should also include them as potential source of pageview traffic. Maybe we can reuse the existing User-Agent policy. This makes a lot of sense. If I build a bot that crawls wikipedia and

...

facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.

Totally agree. I guess a bigger question is why try to differentiate between "spiders" and "bots"

...

at all?

I don't think we need to differentiate between "spiders" and "bots". The most important question we want to respond is: how much of the traffic we consider "human" today is actually "bot". So, +1 "bot" (case-insensitive). On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <jayvdb(a)gmail.com> wrote:

...

On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> wrote:

> > Why not just "Bot", or "MediaWikiBot" which at least encompasses all

sites that the client

can communicate with.

I personally agree with you, "MediaWikiBot" seems to have better

semantics. For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from analytics. -- John Vandenberg _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Marcel Ruiz Forns

4:24 p.m.

So, trying to join everyone's points of view, what about? 1. Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other pages we think necessary. 2. Do some advertising and outreach and get some bot maintainers and maybe some frameworks to implement the User-Agent policy. This would make the existing policy less useless. Thanks all for the feedback! On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

Clearly Wikipedia et al. uses bot to refer to automated software that

edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify.

facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.

Totally agree. I guess a bigger question is why try to differentiate between "spiders"

and "bots" at all?

On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> wrote:

> > Why not just "Bot", or "MediaWikiBot" which at least encompasses all

sites that the client

can communicate with.

I personally agree with you, "MediaWikiBot" seems to have better

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

John Mark Vandenberg

4:39 p.m.

Hi Marcel, It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained. We dont know how much traffic is generated by compat. There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170). There are a lot of clients that need to be upgraded or be decommissioned for this 'add bot' strategy to be effective in the near future. see https://www.mediawiki.org/wiki/API:Client_code The all important missing step is 3. Create a plan to block clients that dont implement the (amended) User-Agent policy. Without that plan, successfully implemented, you will not get quality data (i.e. using 'Netscape' in the U-A to guess 'human' would perform better). On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

So, trying to join everyone's points of view, what about? Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other pages we think necessary. Do some advertising and outreach and get some bot maintainers and maybe some frameworks to implement the User-Agent policy. This would make the existing policy less useless. Thanks all for the feedback! On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

Clearly Wikipedia et al. uses bot to refer to automated software that edits the site but it seems like you are using the term bot to refer to all automated software and it might be good to clarify.

Totally agree.

I guess a bigger question is why try to differentiate between "spiders" and "bots" at all?

On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> wrote:

> > Why not just "Bot", or "MediaWikiBot" which at least encompasses all > sites that the client > can communicate with. I personally agree with you, "MediaWikiBot" seems to have better semantics.

For clients accessing the MediaWiki api, it is redundant. All it does is identify bots that comply with this edict from analytics. -- John Vandenberg _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation

-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- John Vandenberg

Marcel Ruiz Forns

6:14 p.m.

Thanks, John, for your explanation. I can totally see your point. In the past, the Analytics team also considered enforcing the convention by blocking those bots that don't follow it. And that is still an option to consider. One question to all, though: Are we consdering bots that scrape Wikimedia sites directly (not using the api)? And are we considering humans that manually use the api? It's very difficult to identify those, and we could easily end up with false positives and blocking legitimate requests, no? Another option to this thread would be: cancelling the convention and continue working on regexps and other analyses (like the one made for last-access devices). More thoughts? On Mon, Feb 1, 2016 at 3:39 PM, John Mark Vandenberg <jayvdb(a)gmail.com> wrote:

...

pages

we think necessary. Do some advertising and outreach and get some bot maintainers and maybe

some

frameworks to implement the User-Agent policy. This would make the

existing

policy less useless. Thanks all for the feedback! On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote: >> >> Clearly Wikipedia et al. uses bot to refer to automated software that >> edits the site but it seems like you are using the term bot to refer

to all

>> automated software and it might be good to clarify. > > > OK, in the documentation we can make that clear. And looking into that, > I've seen that some bots, in the process of doing their "editing" work

can

> also generate pageviews. So we should also include them as potential

source

> of pageview traffic. Maybe we can reuse the existing User-Agent policy. > > >> This makes a lot of sense. If I build a bot that crawls wikipedia and >> facebook public pages it really doesn't make sense that my bot has a >> "wikimediaBot" user agent, just the word "Bot" should probably be

enough.

> > > Totally agree. > > >> I guess a bigger question is why try to differentiate between "spiders" >> and "bots" at all? > > > I don't think we need to differentiate between "spiders" and "bots". The > most important question we want to respond is: how much of the traffic

> consider "human" today is actually "bot". So, +1 "bot"

(case-insensitive).

On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <jayvdb(a)gmail.com

> wrote: >> >> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> >> wrote: >> >> >> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses all >> >> sites that the client >> >> can communicate with. >> > >> > I personally agree with you, "MediaWikiBot" seems to have better >> > semantics. >> >> For clients accessing the MediaWiki api, it is redundant. >> All it does is identify bots that comply with this edict from

analytics.

-- John Vandenberg _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation

-- John Vandenberg _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Nuria Ruiz

8:42 p.m.

...

It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained.

That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.

...

There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).

Sorry that the tagging is confusing. I think Analytics tag was removed cause this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team. I am cc-ing Bryan who has already done some analysis on bots requests to the API and can probably provide some data. On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg <jayvdb(a)gmail.com> wrote:

...

pages

we think necessary. Do some advertising and outreach and get some bot maintainers and maybe

some

frameworks to implement the User-Agent policy. This would make the

existing

to all

can

> also generate pageviews. So we should also include them as potential

source

enough.

> consider "human" today is actually "bot". So, +1 "bot"

(case-insensitive).

On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <jayvdb(a)gmail.com

analytics.

-- John Vandenberg _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation

-- John Vandenberg _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

8:44 p.m.

...

In the past, the Analytics team also considered enforcing the convention

by blocking those bots that don't follow it. And that is still an option to consider. I would like to point out that I think this is probably the prerogative of api's team rather than analytics.

...

Another option to this thread would be: cancelling the convention and

continue working on regexps I think regardless of our convention we will always be doing regex detection of self-identified bots. Maybe I am missing some nuance here? On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:

...

It will take time for frameworks to implement an amended User-Agent

policy.

For example, pywikipedia (pywikibot compat) is not actively maintained.

That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.

There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).

So, trying to join everyone's points of view, what about? Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy

and

modify it to encourage adding the word "bot" (case-insensitive) to the User-Agent string, so that it can be easily used to identify bots in the anlytics cluster (no regexps). And link that page from whatever other

pages

we think necessary. Do some advertising and outreach and get some bot maintainers and maybe

some

frameworks to implement the User-Agent policy. This would make the

existing

policy less useless. Thanks all for the feedback! On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org wrote: >> >> Clearly Wikipedia et al. uses bot to refer to automated software that >> edits the site but it seems like you are using the term bot to refer

to all

can

> also generate pageviews. So we should also include them as potential

source

enough.

> > > Totally agree. > > >> I guess a bigger question is why try to differentiate between

"spiders"

>> and "bots" at all? > > > I don't think we need to differentiate between "spiders" and "bots".

The

> most important question we want to respond is: how much of the traffic

> consider "human" today is actually "bot". So, +1 "bot"

(case-insensitive).

> > > On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <

jayvdb(a)gmail.com>

> wrote: >> >> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> >> wrote: >> >> >> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses

all

>> >> sites that the client >> >> can communicate with. >> > >> > I personally agree with you, "MediaWikiBot" seems to have better >> > semantics. >> >> For clients accessing the MediaWiki api, it is redundant. >> All it does is identify bots that comply with this edict from

analytics.

> > -- > John Vandenberg > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation

-- John Vandenberg _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Marcel Ruiz Forns

9:46 p.m.

...

Another option to this thread would be: cancelling the convention and

continue working on regexps I think regardless of our convention we will always be doing regex detection of self-identified bots. Maybe I am missing some nuance here?

No, no, Nuria, you're right. I meant continue to improve the regexps and the other means we have to identify bots. I didn't imply that we should stop doing regexps if we establish the convention. On Mon, Feb 1, 2016 at 7:44 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote: > >In the past, the Analytics team also considered enforcing the convention > by blocking those bots that don't follow it. And that is still an option to > consider. > I would like to point out that I think this is probably the prerogative of > api's team rather than analytics. >

...

Another option to this thread would be: cancelling the convention and

continue working on regexps I think regardless of our convention we will always be doing regex detection of self-identified bots. Maybe I am missing some nuance here?

> > > > > > On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote: > >> >It will take time for frameworks to implement an amended User-Agent >> policy. >> >For example, pywikipedia (pywikibot compat) is not actively >> >maintained. >> That doesn't imply we shouldn't have a policy that anyone can refer to, >> these bots will not follow it until they get some maintainers. >> >> >There was a task filled against Analytics for this, but Dan Andreescu >> >removed Analytics (https://phabricator.wikimedia.org/T99373#1859170). >> >> Sorry that the tagging is confusing. I think Analytics tag was removed >> cause this is a request for data and our team doesn't do data retrieval. We >> normally tag with "analytics" phabricator items that have actionables for >> our team. >> I am cc-ing Bryan who has already done some analysis on bots requests to >> the API and can probably provide some data. >> >> >> >> >> On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg <jayvdb(a)gmail.com> >> wrote: >> >>> Hi Marcel, >>> >>> It will take time for frameworks to implement an amended User-Agent >>> policy. >>> For example, pywikipedia (pywikibot compat) is not actively >>> maintained. We dont know how much traffic is generated by compat. >>> There was a task filled against Analytics for this, but Dan Andreescu >>> removed Analytics (https://phabricator.wikimedia.org/T99373#1859170). >>> >>> There are a lot of clients that need to be upgraded or be >>> decommissioned for this 'add bot' strategy to be effective in the near >>> future. see https://www.mediawiki.org/wiki/API:Client_code >>> >>> The all important missing step is >>> >>> 3. Create a plan to block clients that dont implement the (amended) >>> User-Agent policy. >>> >>> Without that plan, successfully implemented, you will not get quality >>> data (i.e. using 'Netscape' in the U-A to guess 'human' would perform >>> better). >>> >>> On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> >>> wrote: >>> > So, trying to join everyone's points of view, what about? >>> > >>> > Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy >>> and >>> > modify it to encourage adding the word "bot" (case-insensitive) to the >>> > User-Agent string, so that it can be easily used to identify bots in >>> the >>> > anlytics cluster (no regexps). And link that page from whatever other >>> pages >>> > we think necessary. >>> > >>> > Do some advertising and outreach and get some bot maintainers and >>> maybe some >>> > frameworks to implement the User-Agent policy. This would make the >>> existing >>> > policy less useless. >>> > >>> > Thanks all for the feedback! >>> > >>> > On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns < >>> mforns(a)wikimedia.org> >>> > wrote: >>> >>> >>> >>> Clearly Wikipedia et al. uses bot to refer to automated software that >>> >>> edits the site but it seems like you are using the term bot to refer >>> to all >>> >>> automated software and it might be good to clarify. >>> >> >>> >> >>> >> OK, in the documentation we can make that clear. And looking into >>> that, >>> >> I've seen that some bots, in the process of doing their "editing" >>> work can >>> >> also generate pageviews. So we should also include them as potential >>> source >>> >> of pageview traffic. Maybe we can reuse the existing User-Agent >>> policy. >>> >> >>> >> >>> >>> This makes a lot of sense. If I build a bot that crawls wikipedia and >>> >>> facebook public pages it really doesn't make sense that my bot has a >>> >>> "wikimediaBot" user agent, just the word "Bot" should probably be >>> enough. >>> >> >>> >> >>> >> Totally agree. >>> >> >>> >> >>> >>> I guess a bigger question is why try to differentiate between >>> "spiders" >>> >>> and "bots" at all? >>> >> >>> >> >>> >> I don't think we need to differentiate between "spiders" and "bots". >>> The >>> >> most important question we want to respond is: how much of the >>> traffic we >>> >> consider "human" today is actually "bot". So, +1 "bot" >>> (case-insensitive). >>> >> >>> >> >>> >> On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg < >>> jayvdb(a)gmail.com> >>> >> wrote: >>> >>> >>> >>> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> >>> >>> wrote: >>> >>> >> >>> >>> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses >>> all >>> >>> >> sites that the client >>> >>> >> can communicate with. >>> >>> > >>> >>> > I personally agree with you, "MediaWikiBot" seems to have better >>> >>> > semantics. >>> >>> >>> >>> For clients accessing the MediaWiki api, it is redundant. >>> >>> All it does is identify bots that comply with this edict from >>> analytics. >>> >>> >>> >>> -- >>> >>> John Vandenberg >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> Analytics mailing list >>> >>> Analytics(a)lists.wikimedia.org >>> >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >> >>> >> >>> >> >>> >> -- >>> >> Marcel Ruiz Forns >>> >> Analytics Developer >>> >> Wikimedia Foundation >>> > >>> > >>> > >>> > >>> > -- >>> > Marcel Ruiz Forns >>> > Analytics Developer >>> > Wikimedia Foundation >>> > >>> > _______________________________________________ >>> > Analytics mailing list >>> > Analytics(a)lists.wikimedia.org >>> > https://lists.wikimedia.org/mailman/listinfo/analytics >>> > >>> >>> >>> >>> -- >>> John Vandenberg >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Bryan Davis

2 Feb 2 Feb

6:21 a.m.

On Mon, Feb 1, 2016 at 11:42 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:

...

It will take time for frameworks to implement an amended User-Agent policy. For example, pywikipedia (pywikibot compat) is not actively maintained.

That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.

There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).

It would be possible to make some relative comparisons of pywikibot versions using the data that is currently collected in the wmf.webrequest data set. "Someday" I'll get T108618 [0] finished which will make answering some of the more granular questions in T99373 easier. Kunal talked with Brad and I a few weeks ago when we were all in SF for the DevSummit about other instrumentation that could be put in place specifically for pywikibot so that something like Special:ApiFeatureUsage [2] could be created for pywikibot version tracking as well. This all seems like a fork of the topic at hand however. [0]: https://phabricator.wikimedia.org/T108618 [1]: https://phabricator.wikimedia.org/T99373 [2]: https://en.wikipedia.org/wiki/Special:ApiFeatureUsage Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

Marcel Ruiz Forns

9:40 p.m.

Hi all, It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion. 3. Create a plan to block clients that dont implement the (amended)

...

User-Agent policy.

I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much benefit we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK with that? If no-one else raises concerns about this, the Analytics team will: 1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified. 2. Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy. Thanks! On Tue, Feb 2, 2016 at 5:21 AM, Bryan Davis <bd808(a)wikimedia.org> wrote:

...

On Mon, Feb 1, 2016 at 11:42 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:

>It will take time for frameworks to implement an amended User-Agent

policy.

For example, pywikipedia (pywikibot compat) is not actively maintained.

That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.

There was a task filled against Analytics for this, but Dan Andreescu removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).

Sorry that the tagging is confusing. I think Analytics tag was removed

cause

this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team. I am cc-ing Bryan who has already done some analysis on bots requests to

the

API and can probably provide some data.

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Bryan Davis

11:29 p.m.

On Tue, Feb 2, 2016 at 12:40 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

Hi all, It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.

3. Create a plan to block clients that dont implement the (amended) User-Agent policy.

I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much benefit we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK with that? If no-one else raises concerns about this, the Analytics team will: Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified. Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy.

The proposed plan sounds good to me. I'm very strongly against the suggestion of blocking anyone's access to api.php or the wikis in general over not having "bot" in the user-agent string however. Getting cleaner analytics is a nice goal but the point of the projects is to collect and disseminate information. You might get blocked for doing something deliberately harmful to the services or the community, but getting blocked for not following an arbitrary convention that causes no real harm is extreme. You will quickly find yourself in a strange conundrum as well. To block you will need to establish intent of the User Agent and if you can do that then you probably don't need the "bot" tagging convention in the first place. Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

Nuria Ruiz

3 Feb 3 Feb

1:35 a.m.

...

I'm very strongly against the suggestion of blocking anyone's access to api.php or the wikis in general over not having "bot" in the user-agent string however. Getting cleaner analytics is a nice goal but the point of the projects is to collect and disseminate information.

Much agreed. On Tue, Feb 2, 2016 at 1:29 PM, Bryan Davis <bd808(a)wikimedia.org> wrote:

...

On Tue, Feb 2, 2016 at 12:40 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

Hi all, It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.

3. Create a plan to block clients that dont implement the (amended) User-Agent policy.

I think we can decide on this later. Steps 1) and 2) can be done first - they should be done anyway before 3) - and then we can see how much

benefit

we raise from them. If we don't get a satisfactory reaction from bot/framework maintainers, we then can go for 3). John, would you be OK

with

that? If no-one else raises concerns about this, the Analytics team will: Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified. Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy.

John Mark Vandenberg

2 a.m.

On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:

...

Hi all, It seems comments are decreasing at this point. I'd like to slowly drive this thread to a conclusion.

3. Create a plan to block clients that dont implement the (amended) User-Agent policy.

I think you need to clearly define what you want to capture and classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years. The eventual definition of 'bot' will be very central to this issue. Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful. One of the strange area's to consider is jquery-based tools that are effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'. If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger. If the proposal is to require only 'bot' in the user-agent, pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163 https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762a… By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported. Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy. Poorly written/single-purpose/once-off clients are less of a problem, as forcing change on them has lower impact. [[w:User_agent]] says: "Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot." So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement. The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said 'I don't think we need to differentiate between "spiders" and "bots".' If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot? But Yahoo! Slurp's useragent doesnt include bot will not. So you will still need a long regex for user-agents of tools which you can't impose this change onto. If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.

...

If no-one else raises concerns about this, the Analytics team will: Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including the word "bot" (case-insensitive) in the User-Agent string, so that bots can be easily identified.

If you are only updating the policy to "encourage" the use of the 'bot' in the user-agent, there will not be any concerns as this is quite common anyway, and it is optional. ;-) The dispute will occur when the addition of 'bot' becomes mandatory. -- John Vandenberg

Marcel Ruiz Forns

4 Feb 4 Feb

12:43 a.m.

John, thank you a lot for taking the time to answer my question. My responses inline (I rearranged some of your paragraphs to respond to them together): I think you need to clearly define what you want to capture and

...

classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.

...

If you do not want Googlebot to be grouped together with api based bots , either the user-agent need to use something more distinctive, such as 'MediaWikiBot', or you will need another regex of all the 'bot' matches which you dont want to be a bot.

...

The `analytics-refinery-source` code currently differentiates between spider and bot, but earlier in this thread you said 'I don't think we need to differentiate between "spiders" and "bots".' If you require 'bot' in the user-agent for bots, this will also capture Googlebot and YandexBot, and many other tools which use 'bot' . Do you want Googlebot to be a bot? But Yahoo! Slurp's useragent doesnt include bot will not. So you will still need a long regex for user-agents of tools which you can't impose this change onto.

Differentiating between "spiders" and "bots" can be very tricky, as you explain. There was some work on it in the past, but what we really want at the moment is: to split the human vs bot traffic with a higher accuracy. I will add that to the docs, thanks. Regarding measuring the impact, as we'll not be able to differentiate "spiders" and "bots", we can only observe the variations of the human vs bot traffic rates in time and try to associate those to recent changes in User-Agent strings or regular expressions. The eventual definition of 'bot' will be very central to this issue.

...

Which tools need to start adding 'bot'? What is 'human' use? This terminology problem has caused much debate on the wikis, reaching arbcom several times. So, precision in the definition will be quite helpful.

Agree, will add that to the proposal. One of the strange area's to consider is jquery-based tools that are

...

effectively bots, performing large numbers of operations on pages in batches with only high-level commands being given by a human. e.g. the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot scripts are also not a 'bot'.

I think the key here is: the program should be tagged as a bot by analytics, if it generates pageviews not consumed onsite by a human. I will mention that in the docs, too. Thanks.

...

If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.

...

Please understand the gravity of what you are imposing. Changing a user-agent of a client is a breaking change, and any decent MediaWiki client is also used by non-Wikimedia wikis, administrated by non-Wikimedia ops teams, who may have their own tools doing analysis of user-agents hitting their servers, possibly including access control rules. And their rules and scripts may break when a client framework changes its user-agent in order to make the Wikimedia Analytics scripts easier. Strictly speaking your user-agent policy proposal requires a new _major_ release for every client framework that you do not grandfather into your proposed user-agent policy.

...

I see your point. The addition of "bot" will be optional (as is the rest of the policy), we will make that clear in the docs. If the proposal is to require only 'bot' in the user-agent,

...

pywikipediabot and pywikibot both need no change to add it (yay!, but do we need to add 'human' to the user-agent for some scripts??), but many client frameworks will still need to change their user-agent, including for example both of the Go frameworks. https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163 https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762a…

...

By doing some analysis of the existing user-agents hitting your servers, maybe you can find an easy way to grandfather in most client frameworks. e.g. if you also add 'github' as a bot pattern, both Go frameworks are automatically now also supported.

...

[[w:User_agent]] says: "Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot." So including URL/email as part of your detection should capture most well written bots. Also including any requests from tools.wmflabs.org and friends as 'bot' might also be a useful improvement.

That is a very good insight. Thanks. Currently, the User-Agent policy is not implemented in our regular expressions, meaning: it does not match emails, nor user pages or other mediawiki urls. It could also, as you suggest, implement matching github accounts, or tools.wmflabs.org. We Analytics should tackle that. I will create a task for that and add it to the proposal. Thanks again, in short I'll send the proposal with the changes. On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg <jayvdb(a)gmail.com> wrote: > On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> > wrote: > > Hi all, > > > > It seems comments are decreasing at this point. I'd like to slowly drive > > this thread to a conclusion. > > > > > >> 3. Create a plan to block clients that dont implement the (amended) > >> User-Agent policy. > > > > > > I think we can decide on this later. Steps 1) and 2) can be done first - > > they should be done anyway before 3) - and then we can see how much > benefit > > we raise from them. If we don't get a satisfactory reaction from > > bot/framework maintainers, we then can go for 3). John, would you be OK > with > > that? > > I think you need to clearly define what you want to capture and

...

classify, and re-evaluate what change to the user-agent policy will have any noticeable impact on your detection accuracy in the next five years.

> > The eventual definition of 'bot' will be very central to this issue.

...

> > One of the strange area's to consider is jquery-based tools that are

...

If gadgets and user-scripts may need to follow the new 'bot' rule of the user-agent policy, the number of developers that need to be engaged is much larger.

> > If the proposal is to require only 'bot' in the user-agent,

...

> > Poorly written/single-purpose/once-off clients are less of a problem, > as forcing change on them has lower impact. > > [[w:User_agent]] says: > > "Bots, such as Web crawlers, often also include a URL and/or e-mail > address so that the Webmaster can contact the operator of the bot." > > So including URL/email as part of your detection should capture most > well written bots. > Also including any requests from tools.wmflabs.org and friends as > 'bot' might also be a useful improvement. > > The `analytics-refinery-source` code currently differentiates between > spider and bot, but earlier in this thread you said > > 'I don't think we need to differentiate between "spiders" and "bots".' > > If you require 'bot' in the user-agent for bots, this will also > capture Googlebot and YandexBot, and many other tools which use 'bot' > . Do you want Googlebot to be a bot? > > But Yahoo! Slurp's useragent doesnt include bot will not. > > So you will still need a long regex for user-agents of tools which you > can't impose this change onto. >

...

> > > If no-one else raises concerns about this, the Analytics team will: > > > > Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to > > encourage including the word "bot" (case-insensitive) in the User-Agent > > string, so that bots can be easily identified. > > If you are only updating the policy to "encourage" the use of the > 'bot' in the user-agent, there will not be any concerns as this is > quite common anyway, and it is optional. ;-) > > The dispute will occur when the addition of 'bot' becomes mandatory. > > -- > John Vandenberg > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Marcel Ruiz Forns

1:07 a.m.

Hi again analytics list, Thank you all for your comments and feedback! We consider this thread closed and will now proceed to: 1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including (optional) the word "bot" (case-insensitive) in the User-Agent string, so that bots that generate pageviews not consumed onsite by humans can be easily identified by the Analytics cluster, thus increasing accuracy of the human-vs-bot traffic split. 2. Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy. 3. The Analytics team should implement the regular expressions that match the current User-Agent policy: User-Agent strings with: emails, user pages, other mediawiki urls, github urls, and tools.wmflabs.org urls. This will take some time, and probably raise technical issues, but seems that we can benefit from it. https://phabricator.wikimedia.org/T125731 Cheers! On Wed, Feb 3, 2016 at 11:43 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote: