Hi,
since the work that happens around the Analytics Cluster and on the Ops side of Analytics is not too visible, it was suggested to improve visibility by having some weekly write-up. Posting it to the public list for a start, but if this is too much noise for you, please let us know.
In the week from 2014-08-18–2014-08-24 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Hadoop worker memory limits now automatically configured * Automatic data removal was prepared and activated for webrequest data * Adjusting access to raw webrequest data * Learning from data ingestion alarms * Webstatscollector and kafka * Distupgrade on stat1003 * Packet loss alarm on oxygen on 2014-08-16 (Bug 69663) * Geowiki data aggregation failed on 2014-08-19 (Bug 69812)
(details below)
Have fun, Christian
* Hadoop worker memory limits now automatically configured
Previously, each worker had the same memory limit, regardless of how much resources the worker really had. By now allowing different memory limits on different workers, we can better utilize the resources of each worker.
* Automatic data removal was prepared and activated for webrequest data
Kraken's setup to remove raw webrequest data after a given number of days (currently: 31) was brought over to refinery and turned on.
* Adjusting access to raw webrequest data
In order to have proper privilege separation on the cluster, access paths have been split in different groups.
* Learning from data ingestion alarms
With the new monitoring in place, we started to look at the alarms and are trying to make sense of them. Monitoring seems to work fine, and the partitions that got flagged, really had issues. On the flip side, checking for some samples that passed monitoring, they look valid too. So monitoring seems effective in both directions.
About the flagged partitions, most of them are races on varnish (Bug 69615). No log lines get lost or duplicated for such races.
There was one incident, where a leader re-election caused a drop of a few hundred log lines (bug 69854). Leader re-election currently may cause such hiccups, but there is already a theory, what is the real root cause of such drops, and it should be fixable.
The only other issue was one hour this Saturday (Bug 69971), which is still pending investigation. It seems it only esams, but all four sources. But a real investigation is still pending.
So the raw data that is flowing into the cluster is generally good. And we're starting on ironing out the glitches exposed by the monitoring.
* Webstatscollector and kafka
We started to work on making webstatscollector consume from kafka. It's a bit more involved than we hoped (burstiness of kafka, buffer receive errors, other processes blocking I/O, ...), but the latest build and setup that is running since about midnight up to now worked without issues. *Knocking on wood*
* Distupgrade on stat1003
stat1003 had it's distribution upgraded. New shiny software for researchers :-)
* Packet loss alarm on oxygen on 2014-08-16 (Bug 69663)
Packetloss was limited to two a few minutes long periods. Root cause for the issues was bug 69661, which backfired.
* Geowiki data aggregation failed on 2014-08-19 (Bug 69812)
A database connection got dropped, which made the aggregation fail on 2014-08-19. The root cause of the connection drop is unknown. Nothing noteworthy happened on used database server, neither on stat1003 (The distupgrade coincidently took place on the same day, but happened later in the day). Since this happened for the first time, we're writing it off as fluke for now.