Hi Oliver,
On Thu, Mar 12, 2015 at 07:00:07PM -0400, Oliver Keyes wrote:
And now the UDF, the hive query, and the monthly
aggregate of the
hive query, all disagree with
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm .
All of the aforementioned sources come up with 24bn, not 20.38.
(Assuming "hive query" means the Hive implementation of the
webstatscollector pageview definition.)
I challenge that your comparing apples to apples.
Taking a quick (and known to not necessarily be fully exact) shot at
reproducing numbers, I can basically verify Erik's numbers [1][2] from
the above URL.
So in contrast to what you claim, Erik's reports and the
Hive-implementation of webstatscollector agree.
Since you said “monthly aggregates”, but referenced to a “normalized”
report of Erik, I somewhat get the feeling you're comparing apples to
oranges.
So (I know I am sounding like a broken record), please Oliver instead
of claiming things without giving us a chance to reproduce, show how
you ended up with your numbers and conclusions.
Have fun,
Christian
P.S.:
Erik,
how is your data constructed from the pagecounts files, exactly? It's
not made clear.
Oh, it is made clear.
Search for “Archived input files” on that page that you linked :-)
(And manually filter to the projects you care about and aggregate to
the period of interest)
If you also challenge the way Erik arrives at those input files, you
can look at the projectcounts files of pagecounts-raw or
pagecounts-all-sites directly, and aggregate from them directly.
The bash pipelines in the below footnotes are a rough, "order of
magnitued" shot at that.
[1] For example running on stat1002:
Wikibooks for February 2015:
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 12:34:55 // exit code: 0
cwd: ~
echo $(( ( $(grep '^[a-zA-Z_-]*\.b '
/mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d '
' | tr '\n' +)0 ) * 30 / 28 / 1000000 ))
43
Wiktionary for February 2015:
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 12:35:51 // exit code: 0
cwd: ~
echo $(( ( $(grep '^[a-zA-Z_-]*\.d '
/mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d '
' | tr '\n' +)0 ) * 30 / 28 / 1000000 ))
244
[...]
Commons for February 2015:
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 12:36:26 // exit code: 0
cwd: ~
echo $(( ( $(grep '^commons\.m '
/mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d '
' | tr '\n' +)0 ) * 30 / 28 / 1000000 ))
329
[2] Note that the reports suffer the usual pagecounts-raw confusion
around “.mw” being across projects. But that just a column header
being wrong. The numbers themselves are fine:
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 12:40:22 // exit code: 0
cwd: ~
echo $(( ( $(grep '^[a-zA-Z_-]*\.mw '
/mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | grep -v
'\(commons\|meta\|incubator\|species\|strategy\|outreach\|usability\|quality\)' |
cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 ))
6972
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage:
http://quelltextlich.at/
---------------------------------------------------------------