Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a
couple threads, but we (Analytics team) never finished establishing and
advertising it. In this email we explain what the convention is today and
what purpose it serves. And also ask for feedback to be sure we can
continue with the next steps.
What is the WikimediaBot convention?
It is a way of better identifying Wikimedia traffic originated by bots.
Today we know that a significant share of Wikimedia traffic comes from
bots. We can recognize a part of that traffic with regular expressions[1],
but we can not recognize all of it, because some bots do not identify
themselves as such. If we could identify a greater part of the bot traffic,
we could also better isolate the human traffic and permit more accurate
analyses.
Who should follow the convention?
Computer programs that access Wikimedia sites or the Wikimedia API for
reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention?
Computer programs that follow the on-site ad-hoc commands of a human, like
browsers. And well known spiders that are otherwise recognizable by their
well known user-agent strings.
How to follow the convention?
The client's user-agent string should contain the word "WikimediaBot". The
word can be anywhere within the user-agent string and is case-sensitive.
So, please, feel free to post your comments/feedback on this thread. In the
course of this discussion we can adjust the convention's definition and, if
no major concerns are raised, in 2 weeks we'll create a documentation page
in Wikitech, send an email to the proper mailing lists and maybe write a
blog post about it.
Thanks a lot!
(*) There is already another convention[2] for bots that EDIT Wikimedia
content.
[1]
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…
[2]
https://www.mediawiki.org/wiki/Manual:Bots
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation