Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can continue with the next steps.
What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from bots. We can recognize a part of that traffic with regular expressions[1], but we can not recognize all of it, because some bots do not identify themselves as such. If we could identify a greater part of the bot traffic, we could also better isolate the human traffic and permit more accurate analyses.
Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human, like browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings.
How to follow the convention? The client's user-agent string should contain the word "WikimediaBot". The word can be anywhere within the user-agent string and is case-sensitive.
So, please, feel free to post your comments/feedback on this thread. In the course of this discussion we can adjust the convention's definition and, if no major concerns are raised, in 2 weeks we'll create a documentation page in Wikitech, send an email to the proper mailing lists and maybe write a blog post about it.
Thanks a lot!
(*) There is already another convention[2] for bots that EDIT Wikimedia content.
[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... [2] https://www.mediawiki.org/wiki/Manual:Bots