Hey,
I've been looking through the documentation on the pageview api in recent
days, and have a question that I have not been able to come up with a
solution to so far.
Per my understanding, the data accessible through the "aggregated by
project" pageview api [1], when filtered to just query "user" agents,
should return the same results as can be found in the hourly pageview dumps
data [2 / 3].
However, while the data is close, in two of my brief tests (for the data of
October 1, 2015) the values did not match up.
Data from "aggregate" API:
en.wikipedia & excluding spiders [4]: 238.845.634
pt.wikipedia & excluding spiders [5]: 11.390.043
Data from pageview dumps [3]:
en & en.zero & en.m: 238.840.836
pt & pt.zero & pt.m: 11.389.979
As you can see while the values are close, they do not match.
What am I missing here? Am I maybe mistaken in the notion that the two data
sources are providing data from the same source and thus should be
compatible?
Felix
[1]
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews
[3]
https://dumps.wikimedia.org/other/pageviews/
[4]
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/…
[5]
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/pt.wikipedia/…