The algorithm has been imperfect for a long time. How long and how
imperfect doesnt matter. Analytics is all about making good use of
imperfect algorithms to provide reasonable approximations.
However, I expect it is the role of Analytics is to improve the
definitions and implementation over time, not force a bad algorithms
into policy.
I don't think it is a bad algorithm. Using 'bot' in the user-agent is a
widely adopted convention, so analytics code needs to implement this (even
if it is an approximation). Because of that, Wikimedia bots having the word
'bot' in their user-agents have been tagged as bots for a long time now.
And it seems to make sense to have a line that refers to that fact in the
user-agent policy.
It is no different from a web browser in how it *may* be used,
although of course typically the primary goal of using
Pywikibot
instead of a Web browser is to reduce the amount of human consumption
and decision making needed to perform a task.
That is also Analytics view on the subject. As you said, it is an
approximation that won't fit all cases. But in general, it makes sense to
approximate that, and tag them as non-human.
On Tue, Mar 22, 2016 at 5:18 AM, John Mark Vandenberg <jayvdb(a)gmail.com>
wrote:
On Tue, Mar 22, 2016 at 12:44 AM, Marcel Ruiz Forns
<mforns(a)wikimedia.org> wrote:
...
I think adding the word bot to the user-agent of bot-like programs is a
widely adopted convention. Actually, the word bot is already (for a long
time now) being parsed and used to tag requests as bot-originated in our
jobs that process requests into pageviews stats, because many external
bots
The algorithm has been imperfect for a long time. How long and how
imperfect doesnt matter. Analytics is all about making good use of
imperfect algorithms to provide reasonable approximations.
However, I expect it is the role of Analytics is to improve the
definitions and implementation over time, not force a bad algorithms
into policy.
Pywiki*bot* has the string 'bot' in its useragent, because it is part
of the product name.
However, not all usage of Pywikibot is a crawler or even a bot, in any
sensible definition of those concepts.
Pywikibot is a *user agent* that knows how to be a client of the
*MediaWiki API*. It can be used for "in-situ human consumption" or
not.
It is no different from a web browser in how it *may* be used,
although of course typically the primary goal of using Pywikibot
instead of a Web browser is to reduce the amount of human consumption
and decision making needed to perform a task. But that is no
different to Gadgets written using the JavaScript libraries that run
in the Web browser.
It can function *exactly* like a web browser reading a special:search
results page, viewing some of those page in the search results, and
making edits to some of them. Each page may be viewed by a real
human, who is making decisions throughout the entire process about
which pages to view and which pages to edit.
Or it can function *exactly* like a crawler, spider, bot, etc., with
zero human consumption.
Almost every script that is packaged with Pywikibot has an automatic
and non-automatic mode of operation.
Should we change our user-agent to "Pywikihuman" when in non-automatic
mode of operation, so that it isnt considered to be a bot by
Analytics?
Using the string 'bot' in the user-agent may be a useful approximation
for Analytics to use circa 2010, but it is bad policy, and Analytics
can and should do much better than that in 2016 now that API usage is
in focus.
There is very little information at
https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
I can see) regarding what use of the API is considered to be a
**page** view. For example, is it a page view when I ask the API for
metadata only of the last revision of a page -- i.e. the page/revision
text is not included in the response?
You're right, and this is a very good question. I fear the only ways to
look into this are browsing the actual code in:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…
I am not very interested in the code, which is at best an attempt at
implementing the API page view definition. I'd like to understand the
high level goal.
However, having read that file, and the accompanying test suite, it is
my understanding that there is no definition of an API page view.
i.e. all requests to api.php , excepting api.php usage by the
Wikipedia App (i.e. with user-agent "WikipediaApp", used by the iOS
and Android Apps), is classified as *not a page view*.
fwiw, rather than reading the source, this test data file with
expected results is a simpler way to see the current status.
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…
or asking the Research team, who owns the
definition.
Could the Research team please publish their definition of API
(api.php) page views, like they do for Web (index.php) page views.
Without this, it is hard to have a serious conversation about how
changing the user-agent policy might be helpful to achieve the goal of
better classifying API page views.
--
John Vandenberg
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation