A user-agent should be client specific, not server specific.
This makes a lot of sense. If I build a bot that crawls wikipedia and facebook public pages it really doesn't make sense that my bot has a "wikimediaBot" user agent, just the word "Bot" should probably be enough.
On Wed, Jan 27, 2016 at 8:47 PM, John Mark Vandenberg jayvdb@gmail.com wrote:
On Thu, Jan 28, 2016 at 11:15 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can
continue
with the next steps.
What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from
bots.
We can recognize a part of that traffic with regular expressions[1], but
we
can not recognize all of it, because some bots do not identify
themselves as
such. If we could identify a greater part of the bot traffic, we could
also
better isolate the human traffic and permit more accurate analyses.
Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered
way.
Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human,
like
browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings.
How to follow the convention? The client's user-agent string should contain the word "WikimediaBot".
The
word can be anywhere within the user-agent string and is case-sensitive.
This is useless unless someone is going to start blocking bots that dont follow it.
There is an existing policy, which is not being followed / enforced.
https://meta.wikimedia.org/wiki/User-Agent_policy
It is also extremely annoying that clients (e.g. Pywikibot) now needs to add a Wikimedia specific tag to their user-agent. A user-agent should be client specific, not server specific. Why not just "Bot", or "MediaWikiBot" which at least encompasses all sites that the client can communicate with.
-- John Vandenberg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics