I'm trying to figure out which of the two pageview definitions we currently have I can use for a question Bob and I are trying to address. It would be great if you share your thoughts. If you choose to do so, please do it by Tuesday, eod, PST.
More details:
What are we doing?
We are building an edit recommendation system that identifies the missing articles in Wikipedia that have a corresponding page in at least one of the top 50 Wikipedia languages, ranks them, and recommends the ranked articles to editors who the algorithm assesses as those who may like to edit the article.
Where does pageview definition come into play?
When we want to rank missing articles. To do the ranking, we want to consider the pageviews to the article in the languages the article exists in, and using this information estimate what the traffic is expected to be in the language the article is missing in.
Why does it matter which pageview definition we use?
We would like to use webstatscollector pageview definition since the hourly data we have based on this definition goes back to roughly September 2014. If we go with the new pageview definition, we will have data for the past 2.5 months. The longer period of time we have data for, the better.
Why don't you then just use webstatscollector data?
We're inclined to do that but we need to make sure that data works for the kind of analysis we want to do. Per discussions with Oliver, webstatscollector data has a lot of pageviews from bots and spiders. The question is: is the effect of bot/spider traffic, i.e., the number of pageviews they add to each page, roughly uniform across all pages? If that is the case, webstatscollector definition will be our choice.
I appreciate your thoughts on this.