>What I would recommend is using the new data in wmf.webrequests, which
>gives you, as you say, about 2.5 months, and filtering the user agent;
>there are a couple of UDFs for user agent detection, including
>isSpider, which also looks for wikimedia-specific bots that ua-parser
>ignores.

So you know adding UA/spider parsing to refined tables is on our backlog of immediate tasks to do. Until then the data on refined tables is unparsed (ua-wise) but using the UDFS that Oliver suggested you can benefit from the new definition.



On Sun, Mar 15, 2015 at 1:05 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
It's not roughly uniform - it varies widely. One of the things I
identified in my experimentation with methods for detecting automata
is that a lot of "bad-faith" automated traffic - the stuff that is
hard to detect even with user agent identification - hits specific
pages lots and lots of times, not every page once (although there are
some bots that do that). With the WSC data, which is both non-granular
and contains no filtering...you're going to have problems.

What I would recommend is using the new data in wmf.webrequests, which
gives you, as you say, about 2.5 months, and filtering the user agent;
there are a couple of UDFs for user agent detection, including
isSpider, which also looks for wikimedia-specific bots that ua-parser
ignores. There are additional measures and heuristics for identifying
traffic that is the result of unbalanced automata, which I'm happy to
talk through with you (a mix of burst detection, heuristics around the
proportion of traffic to each site version, and concentration
measures). The burst detection element, at least, should also be
applicable to the WSC data, so if you find a need for a longer
timeframe you could always use WSC data but investigate applying that
- there are some good frameworks out there for doing so.

On 15 March 2015 at 14:47, Leila Zia <leila@wikimedia.org> wrote:
> Hi,
>
>    I'm trying to figure out which of the two pageview definitions we
> currently have I can use for a question Bob and I are trying to address. It
> would be great if you share your thoughts. If you choose to do so, please do
> it by Tuesday, eod, PST.
>
> More details:
>
> What are we doing?
> We are building an edit recommendation system that identifies the missing
> articles in Wikipedia that have a corresponding page in at least one of the
> top 50 Wikipedia languages, ranks them, and recommends the ranked articles
> to editors who the algorithm assesses as those who may like to edit the
> article.
>
> Where does pageview definition come into play?
> When we want to rank missing articles. To do the ranking, we want to
> consider the pageviews to the article in the languages the article exists
> in, and using this information estimate what the traffic is expected to be
> in the language the article is missing in.
>
> Why does it matter which pageview definition we use?
> We would like to use webstatscollector pageview definition since the hourly
> data we have based on this definition goes back to roughly September 2014.
> If we go with the new pageview definition, we will have data for the past
> 2.5 months. The longer period of time we have data for, the better.
>
> Why don't you then just use webstatscollector data?
> We're inclined to do that but we need to make sure that data works for the
> kind of analysis we want to do. Per discussions with Oliver,
> webstatscollector data has a lot of pageviews from bots and spiders. The
> question is: is the effect of bot/spider traffic, i.e., the number of
> pageviews they add to each page, roughly uniform across all pages? If that
> is the case, webstatscollector definition will be our choice.
>
> I appreciate your thoughts on this.
>
> Best,
> Leila
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics