Seems that eventlog1001 has not received any events since 01:30 UTC on Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misce...
This is pretty severe; I'd page if it wasn't a US holiday.
On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh ori@wikimedia.org wrote:
Seems that eventlog1001 has not received any events since 01:30 UTC on Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misce...
This is pretty severe; I'd page if it wasn't a US holiday.
Kafka clients on eventlog1001 were in a "Autocommitting consumer offset" death-loop and not receiving any events from the Kafka brokers. I ran eventloggingctl stop / eventloggingctl start and they recovered. Needs to be investigated more thoroughly. Otto, can you follow up?
Thanks, Ori, for having a look at this and restarting EL.
I understand it was 01:30 UTC on Friday (today), not Thursday. It went on during 5-6 hours. Unfortunately, the only team-members working full-time yesterday and today are we Europe folks. We weren't there when that happened and we don't get those alerts on the phone, we should though.
This problem happened already like a month ago. We'll backfill the missing events and will investigate. Thanks again for the heads-up.
On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh ori@wikimedia.org wrote:
On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh ori@wikimedia.org wrote:
Seems that eventlog1001 has not received any events since 01:30 UTC on Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misce...
This is pretty severe; I'd page if it wasn't a US holiday.
Kafka clients on eventlog1001 were in a "Autocommitting consumer offset" death-loop and not receiving any events from the Kafka brokers. I ran eventloggingctl stop / eventloggingctl start and they recovered. Needs to be investigated more thoroughly. Otto, can you follow up?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Unfortunately, the only team-members working full-time yesterday and today
are we Europe folks.
We weren't there when that happened and we don't get those alerts on the
phone, we should though. Given that this system is tier-2 i do not think we need an immediate response, 24 hours should be an acceptable ETA. I would say even 48.
On Fri, Nov 27, 2015 at 2:31 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Thanks, Ori, for having a look at this and restarting EL.
I understand it was 01:30 UTC on Friday (today), not Thursday. It went on during 5-6 hours. Unfortunately, the only team-members working full-time yesterday and today are we Europe folks. We weren't there when that happened and we don't get those alerts on the phone, we should though.
This problem happened already like a month ago. We'll backfill the missing events and will investigate. Thanks again for the heads-up.
On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh ori@wikimedia.org wrote:
On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh ori@wikimedia.org wrote:
Seems that eventlog1001 has not received any events since 01:30 UTC on Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misce...
This is pretty severe; I'd page if it wasn't a US holiday.
Kafka clients on eventlog1001 were in a "Autocommitting consumer offset" death-loop and not receiving any events from the Kafka brokers. I ran eventloggingctl stop / eventloggingctl start and they recovered. Needs to be investigated more thoroughly. Otto, can you follow up?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Please take a look at the preliminary outage report (with pretty pictures!). TL;DR: Kafka had a small outage and eventlogging is not resilient enough to deal with those, the reboot that Ori did brought evenlogging back up. We have measures in place to deal with sql insertion after an event like this one but, at this time we need to verify that the SQL insertion has catched up with its backlog.
https://wikitech.wikimedia.org/wiki/Incident_documentation/20151127-EventLog...
On Fri, Nov 27, 2015 at 8:35 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Unfortunately, the only team-members working full-time yesterday and
today are we Europe folks.
We weren't there when that happened and we don't get those alerts on the
phone, we should though. Given that this system is tier-2 i do not think we need an immediate response, 24 hours should be an acceptable ETA. I would say even 48.
On Fri, Nov 27, 2015 at 2:31 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Thanks, Ori, for having a look at this and restarting EL.
I understand it was 01:30 UTC on Friday (today), not Thursday. It went on during 5-6 hours. Unfortunately, the only team-members working full-time yesterday and today are we Europe folks. We weren't there when that happened and we don't get those alerts on the phone, we should though.
This problem happened already like a month ago. We'll backfill the missing events and will investigate. Thanks again for the heads-up.
On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh ori@wikimedia.org wrote:
On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh ori@wikimedia.org wrote:
Seems that eventlog1001 has not received any events since 01:30 UTC on Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misce...
This is pretty severe; I'd page if it wasn't a US holiday.
Kafka clients on eventlog1001 were in a "Autocommitting consumer offset" death-loop and not receiving any events from the Kafka brokers. I ran eventloggingctl stop / eventloggingctl start and they recovered. Needs to be investigated more thoroughly. Otto, can you follow up?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It seems like it would depend on the class of error. 48 hours for events not syncing, fine. 48 hours of /total data loss/ is a completely different class of problem.
On 27 November 2015 at 11:35, Nuria Ruiz nuria@wikimedia.org wrote:
Unfortunately, the only team-members working full-time yesterday and today are we Europe folks. We weren't there when that happened and we don't get those alerts on the phone, we should though.
Given that this system is tier-2 i do not think we need an immediate response, 24 hours should be an acceptable ETA. I would say even 48.
On Fri, Nov 27, 2015 at 2:31 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Thanks, Ori, for having a look at this and restarting EL.
I understand it was 01:30 UTC on Friday (today), not Thursday. It went on during 5-6 hours. Unfortunately, the only team-members working full-time yesterday and today are we Europe folks. We weren't there when that happened and we don't get those alerts on the phone, we should though.
This problem happened already like a month ago. We'll backfill the missing events and will investigate. Thanks again for the heads-up.
On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh ori@wikimedia.org wrote:
On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh ori@wikimedia.org wrote:
Seems that eventlog1001 has not received any events since 01:30 UTC on Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misce...
This is pretty severe; I'd page if it wasn't a US holiday.
Kafka clients on eventlog1001 were in a "Autocommitting consumer offset" death-loop and not receiving any events from the Kafka brokers. I ran eventloggingctl stop / eventloggingctl start and they recovered. Needs to be investigated more thoroughly. Otto, can you follow up?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Team, I checked and, indeed, EventLogging database needs backfilling from 2015-11-27 01:00 until 2015-11-27 07:00. I updated the docs and started the backfilling process. I'll let you know when it it finished. Cheers
On Fri, Nov 27, 2015 at 8:31 PM, Oliver Keyes okeyes@wikimedia.org wrote:
It seems like it would depend on the class of error. 48 hours for events not syncing, fine. 48 hours of /total data loss/ is a completely different class of problem.
On 27 November 2015 at 11:35, Nuria Ruiz nuria@wikimedia.org wrote:
Unfortunately, the only team-members working full-time yesterday and
today
are we Europe folks. We weren't there when that happened and we don't get those alerts on the phone, we should though.
Given that this system is tier-2 i do not think we need an immediate response, 24 hours should be an acceptable ETA. I would say even 48.
On Fri, Nov 27, 2015 at 2:31 AM, Marcel Ruiz Forns <mforns@wikimedia.org
wrote:
Thanks, Ori, for having a look at this and restarting EL.
I understand it was 01:30 UTC on Friday (today), not Thursday. It went
on
during 5-6 hours. Unfortunately, the only team-members working full-time yesterday and
today
are we Europe folks. We weren't there when that happened and we don't get those alerts on the phone, we should though.
This problem happened already like a month ago. We'll backfill the
missing
events and will investigate. Thanks again for the heads-up.
On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh ori@wikimedia.org wrote:
On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh ori@wikimedia.org
wrote:
Seems that eventlog1001 has not received any events since 01:30 UTC on Thursday
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Misce...
This is pretty severe; I'd page if it wasn't a US holiday.
Kafka clients on eventlog1001 were in a "Autocommitting consumer
offset"
death-loop and not receiving any events from the Kafka brokers. I ran eventloggingctl stop / eventloggingctl start and they recovered. Needs
to be
investigated more thoroughly. Otto, can you follow up?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics