Hi,
in the week from 2014-08-25–2014-08-31 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Analytics cluster feeding more logs into logstash * More buffer for kafka brokers * Life support for webstatscollector on udp2log * Webstatscollector and kafka * Webstatscollector counting https requests from ulsfo twice
(details below)
Have fun, Christian
* Analytics cluster feeding more logs into logstash
The analytics cluster previously only fed logs from the worker nodes into logstash, and now also feeds logs from namenodes into logstash.
* More buffer for kafka brokers
During partition leader re-elections, kafka brokers sometimes drop a few log lines. Since the kafka broker buffers were smaller than the time the re-election might take, the buffer size was increased, which could help brokers to handle a partition leader re-election without dropping messages.
* Life support for webstatscollector on udp2log
The production webstatscollector (the software that produces the hourly pageview files, that are used for example by stats.wikimedia.org, and stats.grok.se) that consumes from udp2log started to produce faulty files. As another, no longer needed service on the host that runs part of webstatscollector was greedy around resources, this no longer needed service has been stopped to free up more resources. Strangely enough, those additional resources made webstatscollector misbehave even more. Disks could no longer handle the load. After moving the service to writing to a RAM disk, the host could handle the write load again. This switch not only allowed to bring webstatscollector back to life, but also decreased packet loss on the collector by a bit more than an order of magnitude.
* Webstatscollector and kafka
Last week we reported that we spun up a webstatscollector instance that consumes from kafka instead of udp2log, and that the setup caused some issues at first. We now monitored the “webstatscollector on kafka” setup for a week, and it was producing the data extremely reliably. So with this webstatscollector on kafka, we have a good baseline to compare against when trying to scale up webstatscollector to Hadoop.
* Webstatscollector counting https requests from ulsfo twice
While working on establishing the “webstatscollector on kafka” baseline, it has been discovered that the udp2log webstatscollector counts https requests from ulsfo twice. The corresponding fix has been merged on the same day, but due to “no deploys on Fridays” the deploy did not happen last week. (It has been deployed since, and numbers look good)