Adventures in Clusterland 2014-08-18--2014-08-24 - Analytics

26 Aug 2014

Hi,

since the work that happens around the Analytics Cluster and on the
Ops side of Analytics is not too visible, it was suggested to improve
visibility by having some weekly write-up.
Posting it to the public list for a start, but if this is too much noise
for you, please let us know.

In the week from 2014-08-18–2014-08-24 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:

* Hadoop worker memory limits now automatically configured
* Automatic data removal was prepared and activated for webrequest data
* Adjusting access to raw webrequest data
* Learning from data ingestion alarms
* Webstatscollector and kafka
* Distupgrade on stat1003
* Packet loss alarm on oxygen on 2014-08-16 (Bug 69663)
* Geowiki data aggregation failed on 2014-08-19 (Bug 69812)

(details below)

Have fun,
Christian

* Hadoop worker memory limits now automatically configured

  Previously, each worker had the same memory limit, regardless of how
  much resources the worker really had. By now allowing different memory
  limits on different workers, we can better utilize the resources of
  each worker.

* Automatic data removal was prepared and activated for webrequest data

  Kraken's setup to remove raw webrequest data after a given number of
  days (currently: 31) was brought over to refinery and turned on.

* Adjusting access to raw webrequest data

  In order to have proper privilege separation on the cluster, access
  paths have been split in different groups.

* Learning from data ingestion alarms

  With the new monitoring in place, we started to look at the alarms
  and are trying to make sense of them. Monitoring seems to work fine,
  and the partitions that got flagged, really had issues. On the flip
  side, checking for some samples that passed monitoring, they look
  valid too. So monitoring seems effective in both directions.

  About the flagged partitions, most of them are races on varnish (Bug
  69615). No log lines get lost or duplicated for such races.

  There was one incident, where a leader re-election caused a drop of
  a few hundred log lines (bug 69854). Leader re-election currently
  may cause such hiccups, but there is already a theory, what is the
  real root cause of such drops, and it should be fixable.

  The only other issue was one hour this Saturday (Bug 69971), which
  is still pending investigation. It seems it only esams, but all four
  sources. But a real investigation is still pending.

  So the raw data that is flowing into the cluster is generally
  good. And we're starting on ironing out the glitches exposed by the
  monitoring.

* Webstatscollector and kafka

  We started to work on making webstatscollector consume from
  kafka. It's a bit more involved than we hoped (burstiness of kafka,
  buffer receive errors, other processes blocking I/O, ...), but the
  latest build and setup that is running since about midnight up to
  now worked without issues.
  *Knocking on wood*

* Distupgrade on stat1003

  stat1003 had it's distribution upgraded. New shiny software for
  researchers :-)

* Packet loss alarm on oxygen on 2014-08-16 (Bug 69663)

  Packetloss was limited to two a few minutes long periods. Root cause
  for the issues was bug 69661, which backfired.

* Geowiki data aggregation failed on 2014-08-19 (Bug 69812)

  A database connection got dropped, which made the aggregation fail
  on 2014-08-19. The root cause of the connection drop is
  unknown. Nothing noteworthy happened on used database server,
  neither on stat1003 (The distupgrade coincidently took place on the
  same day, but happened later in the day). Since this happened for
  the first time, we're writing it off as fluke for now.

-- 
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian(a)quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------