On Tue, Mar 22, 2016 at 12:44 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
... I think adding the word bot to the user-agent of bot-like programs is a widely adopted convention. Actually, the word bot is already (for a long time now) being parsed and used to tag requests as bot-originated in our jobs that process requests into pageviews stats, because many external bots include it in their user-agent. See: http://www.useragentstring.com/pages/Crawlerlist/
The algorithm has been imperfect for a long time. How long and how imperfect doesnt matter. Analytics is all about making good use of imperfect algorithms to provide reasonable approximations.
However, I expect it is the role of Analytics is to improve the definitions and implementation over time, not force a bad algorithms into policy.
Pywiki*bot* has the string 'bot' in its useragent, because it is part of the product name. However, not all usage of Pywikibot is a crawler or even a bot, in any sensible definition of those concepts. Pywikibot is a *user agent* that knows how to be a client of the *MediaWiki API*. It can be used for "in-situ human consumption" or not.
It is no different from a web browser in how it *may* be used, although of course typically the primary goal of using Pywikibot instead of a Web browser is to reduce the amount of human consumption and decision making needed to perform a task. But that is no different to Gadgets written using the JavaScript libraries that run in the Web browser.
It can function *exactly* like a web browser reading a special:search results page, viewing some of those page in the search results, and making edits to some of them. Each page may be viewed by a real human, who is making decisions throughout the entire process about which pages to view and which pages to edit.
Or it can function *exactly* like a crawler, spider, bot, etc., with zero human consumption.
Almost every script that is packaged with Pywikibot has an automatic and non-automatic mode of operation. Should we change our user-agent to "Pywikihuman" when in non-automatic mode of operation, so that it isnt considered to be a bot by Analytics?
Using the string 'bot' in the user-agent may be a useful approximation for Analytics to use circa 2010, but it is bad policy, and Analytics can and should do much better than that in 2016 now that API usage is in focus.
There is very little information at
https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that I can see) regarding what use of the API is considered to be a **page** view. For example, is it a page view when I ask the API for metadata only of the last revision of a page -- i.e. the page/revision text is not included in the response?
You're right, and this is a very good question. I fear the only ways to look into this are browsing the actual code in: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-...
I am not very interested in the code, which is at best an attempt at implementing the API page view definition. I'd like to understand the high level goal.
However, having read that file, and the accompanying test suite, it is my understanding that there is no definition of an API page view. i.e. all requests to api.php , excepting api.php usage by the Wikipedia App (i.e. with user-agent "WikipediaApp", used by the iOS and Android Apps), is classified as *not a page view*.
fwiw, rather than reading the source, this test data file with expected results is a simpler way to see the current status.
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-...
or asking the Research team, who owns the definition.
Could the Research team please publish their definition of API (api.php) page views, like they do for Web (index.php) page views.
Without this, it is hard to have a serious conversation about how changing the user-agent policy might be helpful to achieve the goal of better classifying API page views.
-- John Vandenberg