On Mon, Feb 1, 2016 at 7:44 PM, Nuria Ruiz <nuria@wikimedia.org> wrote:

>In the past, the Analytics team also considered enforcing the convention by blocking those bots that don't follow it. And that is still an option to consider.
I would like to point out that I think this is probably the prerogative of api's team rather than analytics.

>Another option to this thread would be: cancelling the convention and continue working on regexps
I think regardless of our convention we will always be doing regex detection of self-identified bots. Maybe I am missing some nuance here?

On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
>It will take time for frameworks to implement an amended User-Agent policy.
>For example, pywikipedia (pywikibot compat) is not actively
>maintained.
That doesn't imply we shouldn't have a policy that anyone can refer to, these bots will not follow it until they get some maintainers.

>There was a task filled against Analytics for this, but Dan Andreescu
>removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).

Sorry that the tagging is confusing. I think Analytics tag was removed cause this is a request for data and our team doesn't do data retrieval. We normally tag with "analytics" phabricator items that have actionables for our team.
I am cc-ing Bryan who has already done some analysis on bots requests to the API and can probably provide some data.

On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg <jayvdb@gmail.com> wrote:
Hi Marcel,

It will take time for frameworks to implement an amended User-Agent policy.
For example, pywikipedia (pywikibot compat) is not actively
maintained. We dont know how much traffic is generated by compat.
There was a task filled against Analytics for this, but Dan Andreescu
removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).

There are a lot of clients that need to be upgraded or be
decommissioned for this 'add bot' strategy to be effective in the near
future. see https://www.mediawiki.org/wiki/API:Client_code

The all important missing step is

3. Create a plan to block clients that dont implement the (amended)
User-Agent policy.

Without that plan, successfully implemented, you will not get quality
data (i.e. using 'Netscape' in the U-A to guess 'human' would perform
better).

On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns <mforns@wikimedia.org> wrote:
> So, trying to join everyone's points of view, what about?
>
> Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and
> modify it to encourage adding the word "bot" (case-insensitive) to the
> User-Agent string, so that it can be easily used to identify bots in the
> anlytics cluster (no regexps). And link that page from whatever other pages
> we think necessary.
>
> Do some advertising and outreach and get some bot maintainers and maybe some
> frameworks to implement the User-Agent policy. This would make the existing
> policy less useless.
>
> Thanks all for the feedback!
>
> On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mforns@wikimedia.org>
> wrote:
>>>
>>> Clearly Wikipedia et al. uses bot to refer to automated software that
>>> edits the site but it seems like you are using the term bot to refer to all
>>> automated software and it might be good to clarify.
>>
>>
>> OK, in the documentation we can make that clear. And looking into that,
>> I've seen that some bots, in the process of doing their "editing" work can
>> also generate pageviews. So we should also include them as potential source
>> of pageview traffic. Maybe we can reuse the existing User-Agent policy.
>>
>>
>>> This makes a lot of sense. If I build a bot that crawls wikipedia and
>>> facebook public pages it really doesn't make sense that my bot has a
>>> "wikimediaBot" user agent, just the word "Bot" should probably be enough.
>>
>>
>> Totally agree.
>>
>>
>>> I guess a bigger question is why try to differentiate between "spiders"
>>> and "bots" at all?
>>
>>
>> I don't think we need to differentiate between "spiders" and "bots". The
>> most important question we want to respond is: how much of the traffic we
>> consider "human" today is actually "bot". So, +1 "bot" (case-insensitive).
>>
>>
>> On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <jayvdb@gmail.com>
>> wrote:
>>>
>>> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mforns@wikimedia.org>
>>> wrote:
>>> >>
>>> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses all
>>> >> sites that the client
>>> >> can communicate with.
>>> >
>>> > I personally agree with you, "MediaWikiBot" seems to have better
>>> > semantics.
>>>
>>> For clients accessing the MediaWiki api, it is redundant.
>>> All it does is identify bots that comply with this edict from analytics.
>>>
>>> --
>>> John Vandenberg
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>>
>> --
>> Marcel Ruiz Forns
>> Analytics Developer
>> Wikimedia Foundation
>
>
>
>
> --
> Marcel Ruiz Forns
> Analytics Developer
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

--
John Vandenberg

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Marcel Ruiz Forns

Analytics Developer

Wikimedia Foundation