I should have started this discussion a while ago, but it's easier to catch
up on work during vacation :)
We currently have 3 available static file dumps of pageview data. I will
explain them here and explain my thoughts on simplifying the situation.
Feel free to turn this thread into a wiki.
* PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>. We
have this data going back to 2007. This is using a very simple pageview
definition which incorrectly counts things like banner views as pageviews
(for example).
* PAGECOUNTS-ALL-SITES
<http://dumps.wikimedia.org/other/pagecounts-all-sites/>. We have this
data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also
adds traffic from the mobile versions of our sites. But it's still using
the same simple pageview definition.
* PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>. We have this
data starting in May 2015. It implements the new and much improved
pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view>
that we now use. This is the same pageview definition used in the pageview
API. This dataset also removes spider traffic and any automata traffic
that we can detect.
All three datasets are in the same format (Domasz's archive format).
So, before we can simplify this confusing situation, we need your help and
input about what to keep and how to keep it. Here's the approach I would
take:
Combine pagecounts-raw with pagecounts-all-sites into a new dataset called
"pagecounts". Keep producing data to this dataset forever, but remove
"pagecounts-raw" and "pagecounts-all-sites". This way, we can compare
new
data with historical data going back as far as we need. We would explain
on
dumps.wikimedia.org/other that this dataset gains mobile data starting
in October 2014, to explain the relative local spike that happens there.
This dataset would remain a pretty bad estimate of actual page views, and
would remain sensitive to automata and spider spikes. But in combination
with the "pageviews" dataset, I think it would be useful.
What do you all think? Sound off in this thread, and if we have consensus
I'll start the cleanup.