Hi,
in the week from 2014-10-13–2014-10-19 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Webstatscollector deployment (Bug 66352, Bug 71790) * Testing potential kafkatee fix * Analytics1021, its partition leader role, and missing data * gp.wmflabs.org showing empty graphs * Database lags * Obtaining HTTPS numbers to assist around POODLE vulnerability * Redeployment of some Hive scripts * Preparations for ua_parser Hive UDF (details below)
Have fun, Christian
* Webstatscollector deployment (Bug 66352, Bug 71790)
As reported previous weeks, new webstatscollector builds have been prepared to stop counting requests to the “Undefined” page (Bug 66352), and to stop counting redirects twice (Bug 71790). Those new builds now got deployed to both webstatscollector pipelines.
* Testing potential kafkatee fix
From time to time kafkatee did not consume from all relevant kafka partitions. The kafkatee maintainer provided a potential fix that is running on analytics1003 since. The kafkatee generated files look good for now, but since the issue previously took some time to manifest, the tests need to run a bit longer.
* Analytics1021, its partition leader role, and missing data
Analytics1021 again dropped out of its partition leader role. This is the first time it happened after ACK parameters got tuned on some machines. The tuning proved to be worth it, as the caches with tuned ACK parameters did not see message loss.
Since the issue happened again later, and again exactly the machines with tuned ACK parameters did not see message loss, we can prepare to roll out the tuned ACK parameters more widely.
* gp.wmflabs.org showing empty graphs
In 2013 some graphs of gp.wmflabs.org have been taken offline due to privacy concerns. However, the main dashboard still referenced some of those graphs, and rendered them as empty graphs. This made the dashboard /look/ broken, although the public graphs were rendered as expected. We updated the dashboard to no longer reference offline graphs, so the dashboard does not look broken any longer.
* Database lags
Due to different, unrelated causes, some databases lagged considerably during this week. Ops got the databases back to normal again.
* Obtaining HTTPS numbers to assist around POODLE vulnerability
In order to decide on how to address the POODLE vulnerability, Ops needed numbers on usage of HTTPS for old browsers. Since this data is not prepared automatically, we extracted the numbers from the logs.
* Redeployment of some Hive scripts
It seems an unannounced Friday deployment during the SF hackathon angered the deployment gods, and caused some Oozie/Hive jobs to not run correctly. So we had to fix the setup, resubmit the jobs, and backfill the missing data. No data got lost.
* Preparations for ua_parser UDF
There is a push from several sides to have a Hive UDF that can parse User-Agents. A good part of time was spent implementing, and reviewing this UDF. But it's not yet merged and will require a bit more work.