Re: [Analytics] WikimediaBot convention

2 Feb 2016


      ...
In the past, the Analytics team also considered enforcing the convention
by blocking those bots that don't follow it. And that is still an option to
consider.
I would like to point out that I think this is probably the prerogative of
api's team rather than analytics.
...
Another option to this thread would be: cancelling the convention and
continue working on regexps
I think regardless of our convention we will always be doing regex
detection of self-identified bots. Maybe I am missing some nuance here?
On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz nuria@wikimedia.org wrote:
...
...
It will take time for frameworks to implement an amended User-Agent
policy.
...
For example, pywikipedia (pywikibot compat) is not actively
maintained.
That doesn't imply we shouldn't have a  policy that anyone can refer to,
these bots will not follow it until they get some maintainers.
...
There was a task filled against Analytics for this, but Dan Andreescu
removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
Sorry that the tagging is confusing. I think Analytics tag was removed
cause this is a request for data and our team doesn't do data retrieval. We
normally tag with "analytics" phabricator items that have actionables for
our team.
I am cc-ing Bryan who has already done some analysis on bots requests to
the API and can probably provide some data.
On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg jayvdb@gmail.com
wrote:
...
Hi Marcel,
It will take time for frameworks to implement an amended User-Agent
policy.
For example, pywikipedia (pywikibot compat) is not actively
maintained.  We dont know how much traffic is generated by compat.
There was a task filled against Analytics for this, but Dan Andreescu
removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
There are a lot of clients that need to be upgraded or be
decommissioned for this 'add bot' strategy to be effective in the near
future.  see https://www.mediawiki.org/wiki/API:Client_code
The all important missing step is

Create a plan to block clients that dont implement the (amended)

User-Agent policy.
Without that plan, successfully implemented, you will not get quality
data (i.e. using 'Netscape' in the U-A to guess 'human' would perform
better).
On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns mforns@wikimedia.org
wrote:
...
So, trying to join everyone's points of view, what about?
Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy
and
...
modify it to encourage adding the word "bot" (case-insensitive) to the
User-Agent string, so that it can be easily used to identify bots in the
anlytics cluster (no regexps). And link that page from whatever other
pages
...
we think necessary.
Do some advertising and outreach and get some bot maintainers and maybe
some
...
frameworks to implement the User-Agent policy. This would make the
existing
...
policy less useless.
Thanks all for the feedback!
On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mforns@wikimedia.org
wrote:
...
...
Clearly Wikipedia et al. uses bot to refer to automated software that
edits the site but it seems like you are using the term bot to refer
to all
...
...
...
automated software and it might be good to clarify.
OK, in the documentation we can make that clear. And looking into that,
I've seen that some bots, in the process of doing their "editing" work
can
...
...
also generate pageviews. So we should also include them as potential
source
...
...
of pageview traffic. Maybe we can reuse the existing User-Agent policy.
...
This makes a lot of sense. If I build a bot that crawls wikipedia and
facebook public pages it really doesn't make sense that my bot has a
"wikimediaBot" user agent, just the word "Bot"  should probably be
enough.
...
...
Totally agree.
...
I guess a bigger question is why try to differentiate between
"spiders"
...
...
...
and "bots" at all?
I don't think we need to differentiate between "spiders" and "bots".
The
...
...
most important question we want to respond is: how much of the traffic
we
...
...
consider "human" today is actually "bot". So, +1 "bot"
(case-insensitive).
...
...
On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <
jayvdb@gmail.com>
...
...
wrote:
...
On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" mforns@wikimedia.org
wrote:
...
>
> Why not just "Bot", or "MediaWikiBot" which at least encompasses
all
...
...
...
...
> sites that the client
> can communicate with.
I personally agree with you, "MediaWikiBot" seems to have better
semantics.
For clients accessing the MediaWiki api, it is redundant.
All it does is identify bots that comply with this edict from
analytics.
...
...
...
--
John Vandenberg

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Marcel Ruiz Forns
Analytics Developer
Wikimedia Foundation
--
Marcel Ruiz Forns
Analytics Developer
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
John Vandenberg

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] WikimediaBot convention