Hi Oliver,
On Thu, Mar 12, 2015 at 07:44:14PM -0400, Oliver Keyes wrote:
On 12 March 2015 at 19:41, Erik Zachte ezachte@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers.
So I just uploaded https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png which shows daily page views as collected by webstatscollector since 2008 and published in hourly projectcounts files in https://dumps.wikimedia.org/other/pagecounts-raw/ and aggregated by Wikistats per project (by week, month, day of week) and published in e.g. http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm (Wikipedia only, but webstatscollector doesn't report on any huge PV increase for other projects)
My initial comment in this thread (again) is that you defined a 'legacy' definition yourself, and built a script to implement your legacy definition.
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
I am with Erik when he refutes it being “his” definition.
It is webstatscollector's definition, which originates (as far as git logs tell) from Domas in 2008 [1], and has seen some updates since from other people like Hampton and Diederik. I think all of them did great work.
Almost 7 years after its implementation, it still is the yardstick at wmf to measure page views by. That's a great achievement. Kudos!
Erik's wonderful reports /use/ data that is based on those definitions. And Christian only ported the webstatscollector C-implementation to Hive.
---------------------
Despite the efforts to update the webstatscollector pageview definition, I heard that technical limitations seem to have gotten in the way back then, and effectively MediaWiki, the WMF-hosted wikis and the shape of the corresponding request-stream changed more often and more heavily than the webstatscollector's definition saw updates. Hence, now that technical limitations are gone, there is need to overhaul the pageview definition.
From my point of view, the numbers computed by the webstatscollector pageview definition and those computed by the overhauled pageview definition need not agree.
But with the webstatscollector pageview definition being the yardstick ... having an understanding within the organization where/why/how those numbers differ would not hurt.
YMMV.
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions.
I have to admit that if you're not interested in doing QA, then the thread's subject of “final pageviews QA” mislead me. I adjusted accordingly.
Have fun, Christian
[1] https://git.wikimedia.org/commit/analytics%2Fwebstatscollector.git/7617da88b...