Hi,
in the week from 2014-10-27–2014-11-02 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Hive UDF to parse user agents with ua_parser * More kafkatee issues * Database replication getting stuck on 'Duplicate entry' * Ganglia's Views broke * Fixing sync of “aggregate-datasets” rsync * Turning down logstash logging * 'research' database user (details below)
Have fun, Christian
* Hive UDF to parse user agents with ua_parser
A Hive UDF to parse User-Agent strings with ua_parser was merged and got deployed to the Analytics cluster. So people with Hive access can now use this UDF to automatically extract browser, OS, and device information.
* More kafkatee issues
After previous week's deployment of the new kafkatee build, we took a closer look at the generated files. While up to now, no partitions got dropped, it turned out that kafkatee is loosing lines if other processes have more disk activity. While---even if other processes have more disk activity---the kafkatee output files are still better than what udp2log can produce, we're investigating if the kafkatee output is good enough for users that need to stream data. (For non-streaming needs, it currently looks like Hive would be the more reliable choice.)
* Database replication getting stuck on 'Duplicate entry'
This week we've been having two more replication lag issues. From the five lag issues of October, the last three times, replication stopped with 'Duplicate entries'. Since this seems to be an emergent pattern, it has been called out with Ops and while they are aware of it, there is currently no fix for this issue.
* Ganglia's Views broke
Ganglia allows to have custom predefined dashboards (see Ganglia's “View” tab), which we use to watch kafka's and varnishkafka's key metrics. It seems that some puppet refactorings broke the existing Ganglia dashboards. As it seems we're one of the few teams using Ganglia dashboards regularly, we fixed puppet's Ganglia View setup.
* Fixing sync of “aggregate-datasets” rsync
Some weeks back, work was started to have stat1002's “aggregate-datasets” directory automatically publish it's content the website at
. Now final tweaks there have been put into place, and automatic publishing now works as expected.
* Turning down logstash logging
It turned out that the combination of Analytics cluster and other new log producers amount for more traffic than the current logstash setup can handle nicely. So the log level from the Analytics cluster got turned down until logstash itself got scaled up.
* 'research' database user
Many researchers and other WMFers are using the 'research' credentials to access the analytics databases, and the time came to switch those credentials to a new password. Since the password was not properly puppetized, discussions were started on how disruptive a change would be and how to best change it. Also puppetization work around it started.