What I would recommend is using the new data in wmf.webrequests, which gives you, as you say, about 2.5 months, and filtering the user agent; there are a couple of UDFs for user agent detection, including isSpider, which also looks for wikimedia-specific bots that ua-parser ignores.
So you know adding UA/spider parsing to refined tables is on our backlog of immediate tasks to do. Until then the data on refined tables is unparsed (ua-wise) but using the UDFS that Oliver suggested you can benefit from the new definition.
On Sun, Mar 15, 2015 at 1:05 PM, Oliver Keyes okeyes@wikimedia.org wrote:
It's not roughly uniform - it varies widely. One of the things I identified in my experimentation with methods for detecting automata is that a lot of "bad-faith" automated traffic - the stuff that is hard to detect even with user agent identification - hits specific pages lots and lots of times, not every page once (although there are some bots that do that). With the WSC data, which is both non-granular and contains no filtering...you're going to have problems.
What I would recommend is using the new data in wmf.webrequests, which gives you, as you say, about 2.5 months, and filtering the user agent; there are a couple of UDFs for user agent detection, including isSpider, which also looks for wikimedia-specific bots that ua-parser ignores. There are additional measures and heuristics for identifying traffic that is the result of unbalanced automata, which I'm happy to talk through with you (a mix of burst detection, heuristics around the proportion of traffic to each site version, and concentration measures). The burst detection element, at least, should also be applicable to the WSC data, so if you find a need for a longer timeframe you could always use WSC data but investigate applying that
- there are some good frameworks out there for doing so.
On 15 March 2015 at 14:47, Leila Zia leila@wikimedia.org wrote:
Hi,
I'm trying to figure out which of the two pageview definitions we currently have I can use for a question Bob and I are trying to address.
It
would be great if you share your thoughts. If you choose to do so,
please do
it by Tuesday, eod, PST.
More details:
What are we doing? We are building an edit recommendation system that identifies the missing articles in Wikipedia that have a corresponding page in at least one of
the
top 50 Wikipedia languages, ranks them, and recommends the ranked
articles
to editors who the algorithm assesses as those who may like to edit the article.
Where does pageview definition come into play? When we want to rank missing articles. To do the ranking, we want to consider the pageviews to the article in the languages the article exists in, and using this information estimate what the traffic is expected to
be
in the language the article is missing in.
Why does it matter which pageview definition we use? We would like to use webstatscollector pageview definition since the
hourly
data we have based on this definition goes back to roughly September
If we go with the new pageview definition, we will have data for the past 2.5 months. The longer period of time we have data for, the better.
Why don't you then just use webstatscollector data? We're inclined to do that but we need to make sure that data works for
the
kind of analysis we want to do. Per discussions with Oliver, webstatscollector data has a lot of pageviews from bots and spiders. The question is: is the effect of bot/spider traffic, i.e., the number of pageviews they add to each page, roughly uniform across all pages? If
that
is the case, webstatscollector definition will be our choice.
I appreciate your thoughts on this.
Best, Leila
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics