Hi,
since the work that happens around the Analytics Cluster and on the
Ops side of Analytics is not too visible, it was suggested to improve
visibility by having some weekly write-up.
Posting it to the public list for a start, but if this is too much noise
for you, please let us know.
In the week from 2014-08-18–2014-08-24 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Hadoop worker memory limits now automatically configured
* Automatic data removal was prepared and activated for webrequest data
* Adjusting access to raw webrequest data
* Learning from data ingestion alarms
* Webstatscollector and kafka
* Distupgrade on stat1003
* Packet loss alarm on oxygen on 2014-08-16 (Bug 69663)
* Geowiki data aggregation failed on 2014-08-19 (Bug 69812)
(details below)
Have fun,
Christian
* Hadoop worker memory limits now automatically configured
Previously, each worker had the same memory limit, regardless of how
much resources the worker really had. By now allowing different memory
limits on different workers, we can better utilize the resources of
each worker.
* Automatic data removal was prepared and activated for webrequest data
Kraken's setup to remove raw webrequest data after a given number of
days (currently: 31) was brought over to refinery and turned on.
* Adjusting access to raw webrequest data
In order to have proper privilege separation on the cluster, access
paths have been split in different groups.
* Learning from data ingestion alarms
With the new monitoring in place, we started to look at the alarms
and are trying to make sense of them. Monitoring seems to work fine,
and the partitions that got flagged, really had issues. On the flip
side, checking for some samples that passed monitoring, they look
valid too. So monitoring seems effective in both directions.
About the flagged partitions, most of them are races on varnish (Bug
69615). No log lines get lost or duplicated for such races.
There was one incident, where a leader re-election caused a drop of
a few hundred log lines (bug 69854). Leader re-election currently
may cause such hiccups, but there is already a theory, what is the
real root cause of such drops, and it should be fixable.
The only other issue was one hour this Saturday (Bug 69971), which
is still pending investigation. It seems it only esams, but all four
sources. But a real investigation is still pending.
So the raw data that is flowing into the cluster is generally
good. And we're starting on ironing out the glitches exposed by the
monitoring.
* Webstatscollector and kafka
We started to work on making webstatscollector consume from
kafka. It's a bit more involved than we hoped (burstiness of kafka,
buffer receive errors, other processes blocking I/O, ...), but the
latest build and setup that is running since about midnight up to
now worked without issues.
*Knocking on wood*
* Distupgrade on stat1003
stat1003 had it's distribution upgraded. New shiny software for
researchers :-)
* Packet loss alarm on oxygen on 2014-08-16 (Bug 69663)
Packetloss was limited to two a few minutes long periods. Root cause
for the issues was bug 69661, which backfired.
* Geowiki data aggregation failed on 2014-08-19 (Bug 69812)
A database connection got dropped, which made the aggregation fail
on 2014-08-19. The root cause of the connection drop is
unknown. Nothing noteworthy happened on used database server,
neither on stat1003 (The distupgrade coincidently took place on the
same day, but happened later in the day). Since this happened for
the first time, we're writing it off as fluke for now.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage:
http://quelltextlich.at/
---------------------------------------------------------------