Re: [Wikitech-l] Small Amendment to User-Agent Policy

22 Mar 2016

...

 The algorithm has been imperfect for a long time.  How long and how
 imperfect doesnt matter.  Analytics is all about making good use of
 imperfect algorithms to provide reasonable approximations.
 However, I expect it is the role of Analytics is to improve the
 definitions and implementation over time, not force a bad algorithms
 into policy. 

I don't think it is a bad algorithm. Using 'bot' in the user-agent is a
widely adopted convention, so analytics code needs to implement this (even
if it is an approximation). Because of that, Wikimedia bots having the word
'bot' in their user-agents have been tagged as bots for a long time now.
And it seems to make sense to have a line that refers to that fact in the
user-agent policy.

It is no different from a web browser in how it *may* be used,
...
  although of course typically the primary goal of using
Pywikibot
 instead of a Web browser is to reduce the amount of human consumption
 and decision making needed to perform a task. 

That is also Analytics view on the subject. As you said, it is an
approximation that won't fit all cases. But in general, it makes sense to
approximate that, and tag them as non-human.

On Tue, Mar 22, 2016 at 5:18 AM, John Mark Vandenberg &lt;jayvdb(a)gmail.com&gt;
wrote:

...
  On Tue, Mar 22, 2016 at 12:44 AM, Marcel Ruiz Forns
 &lt;mforns(a)wikimedia.org&gt; wrote:
  ...
 I think adding the word bot to the user-agent of bot-like programs is a
 widely adopted convention. Actually, the word bot is already (for a long
 time now) being parsed and used to tag requests as bot-originated in our
 jobs that process requests into pageviews stats, because many external  bots
  include it in their user-agent. See:
 http://www.useragentstring.com/pages/Crawlerlist/ 
 The algorithm has been imperfect for a long time.  How long and how
 imperfect doesnt matter.  Analytics is all about making good use of
 imperfect algorithms to provide reasonable approximations.

 However, I expect it is the role of Analytics is to improve the
 definitions and implementation over time, not force a bad algorithms
 into policy.

 Pywiki*bot* has the string 'bot' in its useragent, because it is part
 of the product name.
 However, not all usage of Pywikibot is a crawler or even a bot, in any
 sensible definition of those concepts.
 Pywikibot is a *user agent* that knows how to be a client of the
 *MediaWiki API*.  It can be used for "in-situ human consumption" or
 not.

 It is no different from a web browser in how it *may* be used,
 although of course typically the primary goal of using Pywikibot
 instead of a Web browser is to reduce the amount of human consumption
 and decision making needed to perform a task.  But that is no
 different to Gadgets written using the JavaScript libraries that run
 in the Web browser.

 It can function *exactly* like a web browser reading a special:search
 results page, viewing some of those page in the search results, and
 making edits to some of them.  Each page may be viewed by a real
 human, who is making decisions throughout the entire process about
 which pages to view and which pages to edit.

 Or it can function *exactly* like a crawler, spider, bot, etc., with
 zero human consumption.

 Almost every script that is packaged with Pywikibot has an automatic
 and non-automatic mode of operation.
 Should we change our user-agent to "Pywikihuman" when in non-automatic
 mode of operation, so that it isnt considered to be a bot by
 Analytics?

 Using the string 'bot' in the user-agent may be a useful approximation
 for Analytics to use circa 2010, but it is bad policy, and Analytics
 can and should do much better than that in 2016 now that API usage is
 in focus.

  There is very little information at

https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
 I can see) regarding what use of the API is considered to be a
 **page** view.  For example, is it a page view when I ask the API for
 metadata only of the last revision of a page -- i.e. the page/revision
 text is not included in the response? 
 You're right, and this is a very good question. I fear the only ways to
 look into this are browsing the actual code in:

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…

 I am not very interested in the code, which is at best an attempt at
 implementing the API page view definition.  I'd like to understand the
 high level goal.

 However, having read that file, and the accompanying test suite, it is
 my understanding that there is no definition of an API page view.
 i.e. all requests to api.php , excepting api.php usage by the
 Wikipedia App (i.e. with user-agent "WikipediaApp", used by the iOS
 and Android Apps), is classified as *not a page view*.

 fwiw, rather than reading the source, this test data file with
 expected results is a simpler way to see the current status.

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…

  or asking the Research team, who owns the
definition. 
 Could the Research team please publish their definition of API
 (api.php) page views, like they do for Web (index.php) page views.

 Without this, it is hard to have a serious conversation about how
 changing the user-agent policy might be helpful to achieve the goal of
 better classifying API page views.

 --
 John Vandenberg

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Small Amendment to User-Agent Policy