Re: [Analytics] [Technical] which pageview definition

15 Mar 2015

      ...
What I would recommend is using the new data in wmf.webrequests, which
gives you, as you say, about 2.5 months, and filtering the user agent;
there are a couple of UDFs for user agent detection, including
isSpider, which also looks for wikimedia-specific bots that ua-parser
ignores.
So you know adding UA/spider parsing to refined tables is on our backlog of
immediate tasks to do. Until then the data on refined tables is unparsed
(ua-wise) but using the UDFS that Oliver suggested you can benefit from the
new definition.
On Sun, Mar 15, 2015 at 1:05 PM, Oliver Keyes okeyes@wikimedia.org wrote:
...
It's not roughly uniform - it varies widely. One of the things I
identified in my experimentation with methods for detecting automata
is that a lot of "bad-faith" automated traffic - the stuff that is
hard to detect even with user agent identification - hits specific
pages lots and lots of times, not every page once (although there are
some bots that do that). With the WSC data, which is both non-granular
and contains no filtering...you're going to have problems.
What I would recommend is using the new data in wmf.webrequests, which
gives you, as you say, about 2.5 months, and filtering the user agent;
there are a couple of UDFs for user agent detection, including
isSpider, which also looks for wikimedia-specific bots that ua-parser
ignores. There are additional measures and heuristics for identifying
traffic that is the result of unbalanced automata, which I'm happy to
talk through with you (a mix of burst detection, heuristics around the
proportion of traffic to each site version, and concentration
measures). The burst detection element, at least, should also be
applicable to the WSC data, so if you find a need for a longer
timeframe you could always use WSC data but investigate applying that

there are some good frameworks out there for doing so.

On 15 March 2015 at 14:47, Leila Zia leila@wikimedia.org wrote:
...
Hi,
I'm trying to figure out which of the two pageview definitions we
currently have I can use for a question Bob and I are trying to address.
It
...
would be great if you share your thoughts. If you choose to do so,
please do
...
it by Tuesday, eod, PST.
More details:
What are we doing?
We are building an edit recommendation system that identifies the missing
articles in Wikipedia that have a corresponding page in at least one of
the
...
top 50 Wikipedia languages, ranks them, and recommends the ranked
articles
...
to editors who the algorithm assesses as those who may like to edit the
article.
Where does pageview definition come into play?
When we want to rank missing articles. To do the ranking, we want to
consider the pageviews to the article in the languages the article exists
in, and using this information estimate what the traffic is expected to
be
...
in the language the article is missing in.
Why does it matter which pageview definition we use?
We would like to use webstatscollector pageview definition since the
hourly
...
data we have based on this definition goes back to roughly September

...
If we go with the new pageview definition, we will have data for the past
2.5 months. The longer period of time we have data for, the better.
Why don't you then just use webstatscollector data?
We're inclined to do that but we need to make sure that data works for
the
...
kind of analysis we want to do. Per discussions with Oliver,
webstatscollector data has a lot of pageviews from bots and spiders. The
question is: is the effect of bot/spider traffic, i.e., the number of
pageviews they add to each page, roughly uniform across all pages? If
that
...
is the case, webstatscollector definition will be our choice.
I appreciate your thoughts on this.
Best,
Leila

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Research Analyst
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Technical] which pageview definition