I should have started this discussion a while ago, but it's easier to catch up on work during vacation :)

We currently have 3 available static file dumps of pageview data. I will explain them here and explain my thoughts on simplifying the situation. Feel free to turn this thread into a wiki.

* PAGECOUNTS-RAW. We have this data going back to 2007. This is using a very simple pageview definition which incorrectly counts things like banner views as pageviews (for example).

* PAGECOUNTS-ALL-SITES. We have this data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also adds traffic from the mobile versions of our sites. But it's still using the same simple pageview definition.

* PAGEVIEWS. We have this data starting in May 2015. It implements the new and much improved pageview definition that we now use. This is the same pageview definition used in the pageview API. This dataset also removes spider traffic and any automata traffic that we can detect.

All three datasets are in the same format (Domasz's archive format).

So, before we can simplify this confusing situation, we need your help and input about what to keep and how to keep it. Here's the approach I would take:

Combine pagecounts-raw with pagecounts-all-sites into a new dataset called "pagecounts". Keep producing data to this dataset forever, but remove "pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new data with historical data going back as far as we need. We would explain on dumps.wikimedia.org/other that this dataset gains mobile data starting in October 2014, to explain the relative local spike that happens there. This dataset would remain a pretty bad estimate of actual page views, and would remain sensitive to automata and spider spikes. But in combination with the "pageviews" dataset, I think it would be useful.

What do you all think? Sound off in this thread, and if we have consensus I'll start the cleanup.