Adventures in Clusterland 2014-10-27--2014-11-02 - Analytics

13 Nov 2014


      Hi,
in the week from 2014-10-27–2014-11-02 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Hive UDF to parse user agents with ua_parser
* More kafkatee issues
* Database replication getting stuck on 'Duplicate entry'
* Ganglia's Views broke
* Fixing sync of “aggregate-datasets” rsync
* Turning down logstash logging
* 'research' database user
(details below)
Have fun,
Christian
* Hive UDF to parse user agents with ua_parser
A Hive UDF to parse User-Agent strings with ua_parser was merged and
got deployed to the Analytics cluster. So people with Hive access can
now use this UDF to automatically extract browser, OS, and device
information.
* More kafkatee issues
After previous week's deployment of the new kafkatee build, we took a
closer look at the generated files. While up to now, no partitions got
dropped, it turned out that kafkatee is loosing lines if other
processes have more disk activity.
While---even if other processes have more disk activity---the kafkatee
output files are still better than what udp2log can produce, we're
investigating if the kafkatee output is good enough for users that
need to stream data.
(For non-streaming needs, it currently looks like Hive would be the more
reliable choice.)
* Database replication getting stuck on 'Duplicate entry'
This week we've been having two more replication lag issues. From the
five lag issues of October, the last three times, replication stopped
with 'Duplicate entries'. Since this seems to be an emergent pattern,
it has been called out with Ops and while they are aware of it, there
is currently no fix for this issue.
* Ganglia's Views broke
Ganglia allows to have custom predefined dashboards (see Ganglia's
“View” tab), which we use to watch kafka's and varnishkafka's key
metrics. It seems that some puppet refactorings broke the existing
Ganglia dashboards. As it seems we're one of the few teams using
Ganglia dashboards regularly, we fixed puppet's Ganglia View setup.
* Fixing sync of “aggregate-datasets” rsync
Some weeks back, work was started to have stat1002's
“aggregate-datasets” directory automatically publish it's content the
website at
http://datasets.wikimedia.org
. Now final tweaks there have been put into place, and automatic
publishing now works as expected.
* Turning down logstash logging
It turned out that the combination of Analytics cluster and other new
log producers amount for more traffic than the current logstash setup
can handle nicely. So the log level from the Analytics cluster got
turned down until logstash itself got scaled up.
* 'research' database user
Many researchers and other WMFers are using the 'research' credentials
to access the analytics databases, and the time came to switch those
credentials to a new password.  Since the password was not properly
puppetized, discussions were started on how disruptive a change would
be and how to best change it. Also puppetization work around it
started.
-- 
---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------