I've been trying to fix this data all week! Thought I had, but I hadn't checked in aggregator. Also, I never got emails about page counts all sites, but have been checking things in HDFS. Will look into this more in Monday. Thanks Christian!
On Apr 17, 2015, at 18:47, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Analytics dev team,
just a heads up that it's a week that the pagecounts-all-sites (and pagecounts-raw) did not have the 20150409-160000 file generated [1].
To ease data quality assurances and avoid faulty aggregates, the pageview aggregator scripts that do the aggregation for dashiki's “Reader / Daily Pageviews” block for a week on missing data (unless they are being told that for a given day, missing data is ok).
For the above hourly pagecounts-all-sites file, this week of blocking has now passed without action.
Hence, the aggregator scripts will start aggregating again (to some degree), but the undeclared hole for the 2015-04-09 in the data will naturally start to bubble up.
If that hour's file cannot get generated, adding this date to the BAD_DATES.csv of the aggregator data repository, will unblock the aggregator cron job and make weekly, monthly, aggregates consider 2015-04-09 as day without data.
If that hour's file gets generated, be aware that aggregator per default only automatically backfills for a week. So from today on, you need to explicitly run the script to backfill for 2015-04-09.
Have fun, Christian
P.S.: Since I guess the question of monitoring will arise ... the missing pagecounts file has alerted people at least twice by email. The subsequent aggregator blocking has been logged. But you can add yourself in the MAILTO of the aggregator cron at modules/statistics/manifests/aggregator.pp in puppet, if you want an additional notification for that.
[1] http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-04/ http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-04/
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics