<quote name="Greg Grossmeier" date="2014-06-08" time="20:07:41 -0700">
<quote name="Greg Grossmeier" date="2014-06-08" time="15:31:54 -0700"> > greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" \#wikimedia-operations.06-08.log | wc -l > 1354
Last three days (by Eastern US timezone):
greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" #wikimedia-operations.06-06.log | wc -l 0
greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" #wikimedia-operations.06-07.log | wc -l 80
greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" #wikimedia-operations.06-08.log | wc -l 1833
Do we care? If not, holy cow does this need to be silenced.
(Also, get this in Logstash :) )
From springle:
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg%5B...
Hi Greg --
Sometimes we have connectivity issues with Amsterdam that may cause these errors. We care but they are intermittent and it's been tricky to debug. Andrew will take a look tomorrow.
-Toby
On Sun, Jun 8, 2014 at 8:52 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Greg Grossmeier" date="2014-06-08" time="20:07:41 -0700"> > <quote name="Greg Grossmeier" date="2014-06-08" time="15:31:54 -0700"> > > greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" \#wikimedia-operations.06-08.log | wc -l > > 1354 > > Last three days (by Eastern US timezone): > > greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" \#wikimedia-operations.06-06.log | wc -l > 0 > > greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" \#wikimedia-operations.06-07.log | wc -l > 80 > > greg@rose:~/logs/irclogs/2014/Freenode$ grep "Varnishkafka Delivery Errors" \#wikimedia-operations.06-08.log | wc -l > 1833 > > Do we care? If not, holy cow does this need to be silenced. > > (Also, get this in Logstash :) )
From springle:
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg%5B...
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi!
A bit more info:
Being the earliest Opsen around I poked around on analytics1021 and 1022 (the brokers) and found a disk failure for /dev/sdf on analytics1021, along with corresponding java call stack in the log when the broker died due to the fs remounting as read-only.
I unmounted the disk and found more than a simple fsck is required. I therefore disabled puppet to avoid the endless broker service restart loop, and to avoid filling up /.
Faidon silenced the Icinga noise with a patch.
The problems are at least two fold:
1. Only 1 of 2 brokers alive evidently isn't quite enough capacity. Jgage mentioned on IRC that additional capacity is planned.
2. Ori observed: < ori> presumably the alert is flapping because because the script manages to poll twice between flushes, in which case drerr has not gone up
Sean
Thanks Sean!
On Jun 9, 2014, at 3:56 AM, Sean Pringle springle@wikimedia.org wrote:
Hi!
A bit more info:
Being the earliest Opsen around I poked around on analytics1021 and 1022 (the brokers) and found a disk failure for /dev/sdf on analytics1021, along with corresponding java call stack in the log when the broker died due to the fs remounting as read-only.
I unmounted the disk and found more than a simple fsck is required. I therefore disabled puppet to avoid the endless broker service restart loop, and to avoid filling up /.
Faidon silenced the Icinga noise with a patch.
The problems are at least two fold:
Only 1 of 2 brokers alive evidently isn't quite enough capacity. Jgage mentioned on IRC that additional capacity is planned.
Ori observed: < ori> presumably the alert is flapping because because the script manages to poll twice between flushes, in which case drerr has not gone up
Sean _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics