Hi Oliver,
On Thu, Mar 12, 2015 at 07:00:07PM -0400, Oliver Keyes wrote:
And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38.
(Assuming "hive query" means the Hive implementation of the webstatscollector pageview definition.)
I challenge that your comparing apples to apples.
Taking a quick (and known to not necessarily be fully exact) shot at reproducing numbers, I can basically verify Erik's numbers [1][2] from the above URL.
So in contrast to what you claim, Erik's reports and the Hive-implementation of webstatscollector agree.
Since you said “monthly aggregates”, but referenced to a “normalized” report of Erik, I somewhat get the feeling you're comparing apples to oranges.
So (I know I am sounding like a broken record), please Oliver instead of claiming things without giving us a chance to reproduce, show how you ended up with your numbers and conclusions.
Have fun, Christian
P.S.:
Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
Oh, it is made clear. Search for “Archived input files” on that page that you linked :-) (And manually filter to the projects you care about and aggregate to the period of interest)
If you also challenge the way Erik arrives at those input files, you can look at the projectcounts files of pagecounts-raw or pagecounts-all-sites directly, and aggregate from them directly.
The bash pipelines in the below footnotes are a rough, "order of magnitued" shot at that.
[1] For example running on stat1002:
Wikibooks for February 2015: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:34:55 // exit code: 0 cwd: ~ echo $(( ( $(grep '^[a-zA-Z_-]*.b ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 43
Wiktionary for February 2015: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:35:51 // exit code: 0 cwd: ~ echo $(( ( $(grep '^[a-zA-Z_-]*.d ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 244
[...]
Commons for February 2015: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:36:26 // exit code: 0 cwd: ~ echo $(( ( $(grep '^commons.m ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 329
[2] Note that the reports suffer the usual pagecounts-raw confusion around “.mw” being across projects. But that just a column header being wrong. The numbers themselves are fine: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:40:22 // exit code: 0 cwd: ~ echo $(( ( $(grep '^[a-zA-Z_-]*.mw ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | grep -v '(commons|meta|incubator|species|strategy|outreach|usability|quality)' | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 6972