On Thu, Jan 28, 2016 at 11:15 AM, Marcel Ruiz Forns
<mforns(a)wikimedia.org> wrote:
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a
couple threads, but we (Analytics team) never finished establishing and
advertising it. In this email we explain what the convention is today and
what purpose it serves. And also ask for feedback to be sure we can continue
with the next steps.
What is the WikimediaBot convention?
It is a way of better identifying Wikimedia traffic originated by bots.
Today we know that a significant share of Wikimedia traffic comes from bots.
We can recognize a part of that traffic with regular expressions[1], but we
can not recognize all of it, because some bots do not identify themselves as
such. If we could identify a greater part of the bot traffic, we could also
better isolate the human traffic and permit more accurate analyses.
Who should follow the convention?
Computer programs that access Wikimedia sites or the Wikimedia API for
reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention?
Computer programs that follow the on-site ad-hoc commands of a human, like
browsers. And well known spiders that are otherwise recognizable by their
well known user-agent strings.
How to follow the convention?
The client's user-agent string should contain the word "WikimediaBot". The
word can be anywhere within the user-agent string and is case-sensitive.
This is useless unless someone is going to start blocking bots that
dont follow it.
There is an existing policy, which is not being followed / enforced.
https://meta.wikimedia.org/wiki/User-Agent_policy
It is also extremely annoying that clients (e.g. Pywikibot) now needs
to add a Wikimedia specific tag to their user-agent. A user-agent
should be client specific, not server specific. Why not just "Bot",
or "MediaWikiBot" which at least encompasses all sites that the client
can communicate with.
--
John Vandenberg