Hi Analytics dev team,
just a heads up that it's a week that the pagecounts-all-sites (and pagecounts-raw) did not have the 20150409-160000 file generated [1].
To ease data quality assurances and avoid faulty aggregates, the pageview aggregator scripts that do the aggregation for dashiki's “Reader / Daily Pageviews” block for a week on missing data (unless they are being told that for a given day, missing data is ok).
For the above hourly pagecounts-all-sites file, this week of blocking has now passed without action.
Hence, the aggregator scripts will start aggregating again (to some degree), but the undeclared hole for the 2015-04-09 in the data will naturally start to bubble up.
If that hour's file cannot get generated, adding this date to the BAD_DATES.csv of the aggregator data repository, will unblock the aggregator cron job and make weekly, monthly, aggregates consider 2015-04-09 as day without data.
If that hour's file gets generated, be aware that aggregator per default only automatically backfills for a week. So from today on, you need to explicitly run the script to backfill for 2015-04-09.
Have fun, Christian
P.S.: Since I guess the question of monitoring will arise ... the missing pagecounts file has alerted people at least twice by email. The subsequent aggregator blocking has been logged. But you can add yourself in the MAILTO of the aggregator cron at modules/statistics/manifests/aggregator.pp in puppet, if you want an additional notification for that.
[1] http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-04/ http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-04/
I've been trying to fix this data all week! Thought I had, but I hadn't checked in aggregator. Also, I never got emails about page counts all sites, but have been checking things in HDFS. Will look into this more in Monday. Thanks Christian!
On Apr 17, 2015, at 18:47, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Analytics dev team,
just a heads up that it's a week that the pagecounts-all-sites (and pagecounts-raw) did not have the 20150409-160000 file generated [1].
To ease data quality assurances and avoid faulty aggregates, the pageview aggregator scripts that do the aggregation for dashiki's “Reader / Daily Pageviews” block for a week on missing data (unless they are being told that for a given day, missing data is ok).
For the above hourly pagecounts-all-sites file, this week of blocking has now passed without action.
Hence, the aggregator scripts will start aggregating again (to some degree), but the undeclared hole for the 2015-04-09 in the data will naturally start to bubble up.
If that hour's file cannot get generated, adding this date to the BAD_DATES.csv of the aggregator data repository, will unblock the aggregator cron job and make weekly, monthly, aggregates consider 2015-04-09 as day without data.
If that hour's file gets generated, be aware that aggregator per default only automatically backfills for a week. So from today on, you need to explicitly run the script to backfill for 2015-04-09.
Have fun, Christian
P.S.: Since I guess the question of monitoring will arise ... the missing pagecounts file has alerted people at least twice by email. The subsequent aggregator blocking has been logged. But you can add yourself in the MAILTO of the aggregator cron at modules/statistics/manifests/aggregator.pp in puppet, if you want an additional notification for that.
[1] http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-04/ http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-04/
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Andrew,
On Fri, Apr 17, 2015 at 07:06:58PM -0400, Andrew Otto wrote:
I've been trying to fix this data all week!
I am with you. Having had to do it a few times in the past, I definitely know the pain you're going through :-/
Also, I never got emails about page counts all sites, [...]
Since the issue was earlier in the pipeline, the expected emails would not be about pagecounts-all-sites, but about a failed refining step (which blocks all downstream consumers of that partition [1]). The corresponding Oozie ID for the failed refining job is:
0058532-150220163729023-oozie-oozi-C@238
If you want specific alerts about pagecounts-all-sites,
https://gerrit.wikimedia.org/r/#/c/205067/
would be a simple way to achieve that.
[...] but have been checking things in HDFS.
If jobs really failed or hung (as it seems it was the case here), I typically just abused the status script and grepped for a status X ... like
dump() { /srv/deployment/analytics/refinery/bin/refinery-dump-status-webrequest-partitions --datasets legacy_tsvs,mediacounts,pagecounts_all_sites,pagecounts_raw,webrequest $((15*24)) ; } ; dump | head -n 4 ; dump | grep X
That always gave me a nice list of where re-runs are still necessary.
(Of course, if jobs did not fail/hang but ran too early due to an overloaded cluster, the above command would not expose races like the one for 2015-04-15T15 on text)
Will look into this more in Monday.
You rock!
Have fun, Christian
[1] https://commons.wikimedia.org/wiki/File:Refinery-oozie-overview.png
I reran the offending jobs today. This data should now be available.
On Apr 18, 2015, at 19:07, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Andrew,
On Fri, Apr 17, 2015 at 07:06:58PM -0400, Andrew Otto wrote:
I've been trying to fix this data all week!
I am with you. Having had to do it a few times in the past, I definitely know the pain you're going through :-/
Also, I never got emails about page counts all sites, [...]
Since the issue was earlier in the pipeline, the expected emails would not be about pagecounts-all-sites, but about a failed refining step (which blocks all downstream consumers of that partition [1]). The corresponding Oozie ID for the failed refining job is:
0058532-150220163729023-oozie-oozi-C@238
If you want specific alerts about pagecounts-all-sites,
https://gerrit.wikimedia.org/r/#/c/205067/
would be a simple way to achieve that.
[...] but have been checking things in HDFS.
If jobs really failed or hung (as it seems it was the case here), I typically just abused the status script and grepped for a status X ... like
dump() { /srv/deployment/analytics/refinery/bin/refinery-dump-status-webrequest-partitions --datasets legacy_tsvs,mediacounts,pagecounts_all_sites,pagecounts_raw,webrequest $((15*24)) ; } ; dump | head -n 4 ; dump | grep X
That always gave me a nice list of where re-runs are still necessary.
(Of course, if jobs did not fail/hang but ran too early due to an overloaded cluster, the above command would not expose races like the one for 2015-04-15T15 on text)
Will look into this more in Monday.
You rock!
Have fun, Christian
[1] https://commons.wikimedia.org/wiki/File:Refinery-oozie-overview.png
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
just to keep archives happy ... the below issue is now T97588
https://phabricator.wikimedia.org/T97588
.
Have fun, Christian
On Sat, Apr 18, 2015 at 12:47:20AM +0200, Christian Aistleitner wrote:
Hi Analytics dev team,
just a heads up that it's a week that the pagecounts-all-sites (and pagecounts-raw) did not have the 20150409-160000 file generated [1].
To ease data quality assurances and avoid faulty aggregates, the pageview aggregator scripts that do the aggregation for dashiki's “Reader / Daily Pageviews” block for a week on missing data (unless they are being told that for a given day, missing data is ok).
For the above hourly pagecounts-all-sites file, this week of blocking has now passed without action.
Hence, the aggregator scripts will start aggregating again (to some degree), but the undeclared hole for the 2015-04-09 in the data will naturally start to bubble up.
If that hour's file cannot get generated, adding this date to the BAD_DATES.csv of the aggregator data repository, will unblock the aggregator cron job and make weekly, monthly, aggregates consider 2015-04-09 as day without data.
If that hour's file gets generated, be aware that aggregator per default only automatically backfills for a week. So from today on, you need to explicitly run the script to backfill for 2015-04-09.
Have fun, Christian
P.S.: Since I guess the question of monitoring will arise ... the missing pagecounts file has alerted people at least twice by email. The subsequent aggregator blocking has been logged. But you can add yourself in the MAILTO of the aggregator cron at modules/statistics/manifests/aggregator.pp in puppet, if you want an additional notification for that.
[1] http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-04/ http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-04/
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/