Unfortunately, the only team-members working full-time
yesterday and today
are we Europe folks.
We weren't there when that happened and we
don't get those alerts on the
phone, we should though.
Given that this system is tier-2 i do not think we need an immediate
response, 24 hours should be an acceptable ETA. I would say even 48.
On Fri, Nov 27, 2015 at 2:31 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org>
wrote:
Thanks, Ori, for having a look at this and restarting
EL.
I understand it was 01:30 UTC on Friday (today), not Thursday. It went on
during 5-6 hours.
Unfortunately, the only team-members working full-time yesterday and today
are we Europe folks.
We weren't there when that happened and we don't get those alerts on the
phone, we should though.
This problem happened already like a month ago. We'll backfill the missing
events and will investigate.
Thanks again for the heads-up.
On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh <ori(a)wikimedia.org> wrote:
On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh
<ori(a)wikimedia.org> wrote:
Seems that eventlog1001 has not received any
events since 01:30 UTC on
Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misc…
This is pretty severe; I'd page if it wasn't a US holiday.
Kafka clients on eventlog1001 were in a "Autocommitting consumer offset"
death-loop and not receiving any events from the Kafka brokers. I ran
eventloggingctl stop / eventloggingctl start and they recovered. Needs to
be investigated more thoroughly. Otto, can you follow up?
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics