EventLogging postmortem, and maintenance responsibilities

List overview All Threads
Download

newer

older

purging old data from eventlogging...

s1-analytics-slave lag

Ori Livneh

20 Mar 2014 20 Mar '14

11:52 a.m.

At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming connections. At some point during the subsequent hour, MariaDB had either crashed or been manually restarted. Sean noticed that the database was choking on some queries from the researchers and notified the wmfresearch list.

During the time that the database server was out or rejecting connection, the EventLogging writer that writes to db1047 was repeatedly failing to connect to it:

sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on 'db1047.eqiad.wmnet' (111)")

The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned.

This triggered an Icinga alert: [00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047

This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing.

We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement.

It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices.

Sean proposed moving the EventLogging database to m2, so that it runs on separate hardware from the research databases. I think he's right. I filed < https://rt.wikimedia.org/Ticket/Display.html?id=7081%3E to request the migration.

There is some code rot around the Ganglia and Graphite monitoring code for EventLogging. I don't think it would take much to fix. Could the Analytics team take this on?

The Puppet code is well-documented. < https://wikitech.wikimedia.org/wiki/EventLogging%3E could use some updating, but it is mostly current.

Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

--- Ori Livneh ori@wikimedia.org

Attachments:

attachment.htm (text/html — 3.0 KB)

Show replies by date

Dan Andreescu

20 Mar 20 Mar

12:50 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Thank you for the detailed write-up Ori

We have to fix this. The level of maintenance that EventLogging gets is not

...

proportional to its usage across the organization. Analytics, I really need you to step up your involvement.

It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices.

Indeed, the owner seat is vacant. According to a recent discussion on the analytics list, we did not yet consider ourselves the proper owners of EventLogging. Our sprint planning is today and I'll bring it up and note its importance in light of this down time.

Sean proposed moving the EventLogging database to m2, so that it runs on

...

separate hardware from the research databases. I think he's right. I filed < https://rt.wikimedia.org/Ticket/Display.html?id=7081%3E to request the migration.

Thank you, I support isolation.

Finally, I think EventLogging Icinga alerts should have a higher profile,

...

and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

I think this is the key reason the failure was ignored, so I agree here. We should at the very least forward these alerts as an email to analytics devs. I have no idea how to do that, if anyone would like to help that'd be great.

matanya

1:38 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Regarding your last point, it seems like a mail is sent in this case, to the following users:

define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric }

and the role includes this group in the contacts:

nrpe::monitor_service { 'eventlogging': ensure => 'present', description => 'Check status of defined EventLogging jobs', nrpe_command => '/usr/lib/nagios/plugins/check_eventlogging_jobs', require => File['/usr/lib/nagios/plugins/check_eventlogging_jobs'], contact_group => 'admins,analytics', }

If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

Matanya

On 2014-03-20 13:50, Dan Andreescu wrote:

...

Thank you for the detailed write-up Ori

...
We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement.

It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices.

Indeed, the owner seat is vacant. According to a recent discussion on the analytics list, we did not yet consider ourselves the proper owners of EventLogging. Our sprint planning is today and I'll bring it up and note its importance in light of this down time.

...
Sean proposed moving the EventLogging database to m2, so that it runs on separate hardware from the research databases. I think he's right. I filed <https://rt.wikimedia.org/Ticket/Display.html?id=7081 [1]> to request the migration.

Thank you, I support isolation.

...
Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

I think this is the key reason the failure was ignored, so I agree here. We should at the very least forward these alerts as an email to analytics devs. I have no idea how to do that, if anyone would like to help that'd be great.

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops [2]

Links: ------ [1] https://rt.wikimedia.org/Ticket/Display.html?id=7081 [2] https://lists.wikimedia.org/mailman/listinfo/ops

Dan Andreescu

2:18 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 8:38 AM, matanya matanya@foss.co.il wrote:

...

Regarding your last point, it seems like a mail is sent in this case, to the following users:

define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric }

Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future.

...

If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it.

Andrew Otto

2:55 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

This is not an eventlogging alert list, but a general analytics alert list. Anyone in analytics should probably be on it.

On Mar 20, 2014, at 9:18 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

On Thu, Mar 20, 2014 at 8:38 AM, matanya matanya@foss.co.il wrote: Regarding your last point, it seems like a mail is sent in this case, to the following users:

define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric }

Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future. If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it. _______________________________________________ Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

Nuria Ruiz

2:57 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

This is not an eventlogging alert list, but a general analytics alert

list. Anyone in analytics should probably be on it. +1

On Thu, Mar 20, 2014 at 2:55 PM, Andrew Otto otto@wikimedia.org wrote:

...

This is not an eventlogging alert list, but a general analytics alert list. Anyone in analytics should probably be on it.

On Mar 20, 2014, at 9:18 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

On Thu, Mar 20, 2014 at 8:38 AM, matanya matanya@foss.co.il wrote:

...
Regarding your last point, it seems like a mail is sent in this case, to the following users:

define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric }

Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future.

...
If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it. _______________________________________________ Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

2:59 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

...
This is not an eventlogging alert list, but a general analytics alert

list. Anyone in analytics should probably be on it. +1

Oh yeah, in that case totally

Dan Andreescu

3:29 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

...
This is not an eventlogging alert list, but a general analytics alert

list. Anyone in analytics should probably be on it. +1

Oh yeah, in that case definitely please add anyone in analytics dev -

nuria, qchris, spetrea, csalvia

Daniel Zahn

3:49 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 7:29 AM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...

Oh yeah, in that case definitely please add anyone in analytics dev - nuria, qchris, spetrea, csalvia

Hi,

so for that you needed 2 things.. contacts defined in the private repository (we do this to keep phone numbers private).

dandreescu already existed but was not in the analytics group.

nuria,qchris,spetrea and csalvia had to be created. I just did that. You are mail (not paging) contacts for now and you get it 24/7 (as usual when it's just mail), for all the "normal" events like CRITICAL, RECOVERY...

You can see the details below.

The second part is then adding those contacts to your analytics group. That part is in the public repository, and the patch is here:

https://gerrit.wikimedia.org/r/#/c/119753/1

Daniel

---- <snip> ----

define contact{ contact_name csalvia alias Charles Salvia host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email csalvia@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name nuria alias Nuria Ruiz host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email nuria@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name qchris alias Christian Aistleitner host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email caistleitner@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name spetrea alias Stefan Petrea host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email spetrea@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name dandreescu alias Dan Andreescu host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email dandreescu@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

-- Daniel Zahn dzahn@wikimedia.org Operations Engineer

Aaron Halfaker

4:18 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

FYI: Just filed an RT for Dario, Leila, Oliver and myself to be notified as well.

On Thu, Mar 20, 2014 at 9:49 AM, Daniel Zahn dzahn@wikimedia.org wrote:

...

On Thu, Mar 20, 2014 at 7:29 AM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...
Oh yeah, in that case definitely please add anyone in analytics dev - nuria, qchris, spetrea, csalvia

Hi,

so for that you needed 2 things.. contacts defined in the private repository (we do this to keep phone numbers private).

dandreescu already existed but was not in the analytics group.

nuria,qchris,spetrea and csalvia had to be created. I just did that. You are mail (not paging) contacts for now and you get it 24/7 (as usual when it's just mail), for all the "normal" events like CRITICAL, RECOVERY...

You can see the details below.

The second part is then adding those contacts to your analytics group. That part is in the public repository, and the patch is here:

https://gerrit.wikimedia.org/r/#/c/119753/1

Daniel

---- <snip> ----

define contact{ contact_name csalvia alias Charles Salvia host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email csalvia@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name nuria alias Nuria Ruiz host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email nuria@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name qchris alias Christian Aistleitner host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email caistleitner@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name spetrea alias Stefan Petrea host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email spetrea@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

define contact{ contact_name dandreescu alias Dan Andreescu host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email dandreescu@wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email }

-- Daniel Zahn dzahn@wikimedia.org Operations Engineer

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Aaron Halfaker

3:52 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

We should probably add the researchers to this alert list too since we tend to be the ones causing the alerts. I'd rather be able to fix my own problems then have a dev track me down to figure out what I'm working on.

On Thu, Mar 20, 2014 at 9:29 AM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...

...
This is not an eventlogging alert list, but a general analytics alert list. Anyone in analytics should probably be on it. +1

Oh yeah, in that case definitely please add anyone in analytics dev -

nuria, qchris, spetrea, csalvia

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

8 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi,

On Thu, Mar 20, 2014 at 10:29:38AM -0400, Dan Andreescu wrote:

...

...
Oh yeah, in that case definitely please add anyone in analytics dev -

[...], qchris, [...]

I know it happened with best intentions, but please do not add me to lists/alerts without checking back with me.

I hope https://gerrit.wikimedia.org/r/119796 should remove me again.

Best regards, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

Aaron Halfaker

2:59 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

*re. moving EventLogging to m2*

If we do that, we'll need to also set up a process for copying new events to db1047 so that we can continue to join EL data against enwiki. This is critical for our research.

On Thu, Mar 20, 2014 at 8:18 AM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...

On Thu, Mar 20, 2014 at 8:38 AM, matanya matanya@foss.co.il wrote:

...
Regarding your last point, it seems like a mail is sent in this case, to the following users:

define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric }

Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future.

...
If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

3:31 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

definitely, segregating EL data from MediaWiki databases will significantly affect our ability to do research.

On Mar 20, 2014, at 6:59 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...

re. moving EventLogging to m2

If we do that, we'll need to also set up a process for copying new events to db1047 so that we can continue to join EL data against enwiki. This is critical for our research.

On Thu, Mar 20, 2014 at 8:18 AM, Dan Andreescu dandreescu@wikimedia.org wrote: On Thu, Mar 20, 2014 at 8:38 AM, matanya matanya@foss.co.il wrote: Regarding your last point, it seems like a mail is sent in this case, to the following users:

define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric }

Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future. If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

21 Mar 21 Mar

3 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 11:59 PM, Aaron Halfaker ahalfaker@wikimedia.orgwrote:

...

*re. moving EventLogging to m2*

If we do that, we'll need to also set up a process for copying new events to db1047 so that we can continue to join EL data against enwiki. This is critical for our research.

Noted.

Sean

-- DBA @ WMF

Greg Grossmeier

20 Mar 20 Mar

9:20 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming connections. At some point during the subsequent hour, MariaDB had either crashed or been manually restarted. Sean noticed that the database was choking on some queries from the researchers and notified the wmfresearch list.

Can someone from Analytics own this post-mortem and put it on the wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation

Please add specific next steps (with bug#, RT#s, or gerrit urls), even (especially) things you haven't done yet and are just "nice to have".

Thanks,

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Dan Andreescu

11:40 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

<quote name="Ori Livneh" date="2014-03-20" time="03:52:01 -0700"> > At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming > connections. At some point during the subsequent hour, MariaDB had either > crashed or been manually restarted. Sean noticed that the database was > choking on some queries from the researchers and notified the wmfresearch > list.

Can someone from Analytics own this post-mortem and put it on the wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation

Please add specific next steps (with bug#, RT#s, or gerrit urls), even (especially) things you haven't done yet and are just "nice to have".

I think the "switch over" to this being owned by the Analytics team has not been officially done yet. We're going to talk about it early next week. Until then, I'd like to defer this task as we have to meet our other commitments.

Christian Aistleitner

27 Mar 27 Mar

6:58 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Analytics Dev team,

On Thu, Mar 20, 2014 at 01:20:54PM -0700, Greg Grossmeier wrote:

...

<quote name="Ori Livneh" date="2014-03-20" time="03:52:01 -0700"> > [ At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming > connections. At some point during the subsequent hour, MariaDB had either > crashed or been manually restarted. Sean noticed that the database was > choking on some queries from the researchers and notified the wmfresearch > list.

Can someone from Analytics own this post-mortem and put it on the wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation

Please add specific next steps (with bug#, RT#s, or gerrit urls), even (especially) things you haven't done yet and are just "nice to have".

it's been a week, and I cannot find the post-mortem Greg requested at the above URL :-/

Neither did I see a response from our team to Greg's email.

I lost track of our EventLogging responsibilities during the recent back and forth. So:

Toby, are we actually grabbing Greg's item or are we pushing back on it?

Best regards, Christian

P.S.: Toby, if we're grabbing it: I totally lack knowledge about both EventLogging, and the incident itself. So, be prepared for double slow start if I get to work on it.

Christian Aistleitner

3 Apr 3 Apr

3:27 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby,

and zooooooooom ... there goes another week without us even deciding whether or not we feel responsible doing the incident documentation and follow-up work. :-D

I feel somewhat embarrassed that after two weeks, and after the ping on mailing lists, we still did not yet manage to tell Greg at least whether or not we'll work on it.

So,—if you do not chime in/push back by then—I'll be bold and I'll consider our given lip service around EventLogging a commitment and start working on it on Monday (2014-04-07).

Best regards, Christian

On Thu, Mar 27, 2014 at 06:58:27PM +0100, Christian Aistleitner wrote:

...

Hi Analytics Dev team,

On Thu, Mar 20, 2014 at 01:20:54PM -0700, Greg Grossmeier wrote:

...
<quote name="Ori Livneh" date="2014-03-20" time="03:52:01 -0700"> > [ At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming > connections. At some point during the subsequent hour, MariaDB had either > crashed or been manually restarted. Sean noticed that the database was > choking on some queries from the researchers and notified the wmfresearch > list.

Can someone from Analytics own this post-mortem and put it on the wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation

Please add specific next steps (with bug#, RT#s, or gerrit urls), even (especially) things you haven't done yet and are just "nice to have".

it's been a week, and I cannot find the post-mortem Greg requested at the above URL :-/

Neither did I see a response from our team to Greg's email.

I lost track of our EventLogging responsibilities during the recent back and forth. So:

Toby, are we actually grabbing Greg's item or are we pushing back on it?

Best regards, Christian

P.S.: Toby, if we're grabbing it: I totally lack knowledge about both EventLogging, and the incident itself. So, be prepared for double slow start if I get to work on it.

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

Toby Negrin

4 Apr 4 Apr

3:51 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi all,

Christian -- thanks for following up on this.

I've created a ticket[1] for this issue as a production issue. Kevin -- please triage tomorrow in standup. We can own the actual incident report but we'll need to get some help from Ori in understanding how to perform the post mortem.

The current status for EventLogging support is that Ori, the Analytics team, the Operations team and the Platform teams are discussing the handover of EventLogging. The Analytics team will own EventLogging as soon as we can, but we need to get consensus on the details.

I've written up our discussions on this wiki page[2]. Please feel free to add/discuss. We've had some preliminary discussions with Andrew Otto but need to follow up with Rob and Ori.

-Toby

[1] https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/1526 [1] https://www.mediawiki.org/wiki/Analytics/EventLogging

On Thu, Apr 3, 2014 at 6:27 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:

...

Hi Toby,

and zooooooooom ... there goes another week without us even deciding whether or not we feel responsible doing the incident documentation and follow-up work. :-D

I feel somewhat embarrassed that after two weeks, and after the ping on mailing lists, we still did not yet manage to tell Greg at least whether or not we'll work on it.

So,--if you do not chime in/push back by then--I'll be bold and I'll consider our given lip service around EventLogging a commitment and start working on it on Monday (2014-04-07).

Best regards, Christian

On Thu, Mar 27, 2014 at 06:58:27PM +0100, Christian Aistleitner wrote:

...
Hi Analytics Dev team,

On Thu, Mar 20, 2014 at 01:20:54PM -0700, Greg Grossmeier wrote:

...
<quote name="Ori Livneh" date="2014-03-20" time="03:52:01 -0700"> > [ At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming > connections. At some point during the subsequent hour, MariaDB had

either

...
...
...
crashed or been manually restarted. Sean noticed that the database

was

...
...
...
choking on some queries from the researchers and notified the

wmfresearch

...
...
...
list.

Can someone from Analytics own this post-mortem and put it on the wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation

Please add specific next steps (with bug#, RT#s, or gerrit urls), even (especially) things you haven't done yet and are just "nice to have".

it's been a week, and I cannot find the post-mortem Greg requested at the above URL :-/

Neither did I see a response from our team to Greg's email.

I lost track of our EventLogging responsibilities during the recent back and forth. So:

Toby, are we actually grabbing Greg's item or are we pushing back on it?

Best regards, Christian

P.S.: Toby, if we're grabbing it: I totally lack knowledge about both EventLogging, and the incident itself. So, be prepared for double slow start if I get to work on it.

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

Christian Aistleitner

22 Apr 22 Apr

1:06 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby,

[ moving Ops to Bcc, as it seems to have become analytics specific ]

On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

...

The Analytics team will own EventLogging as soon as we can, but we need to get consensus on the details. [...] Please feel free to add/discuss. We've had some preliminary discussions with Andrew Otto but need to follow up with Rob and Ori.

as the EventLogging transition page at

https://www.mediawiki.org/wiki/Analytics/EventLogging

basically came to a rest in the past two weeks, and is a bit scarce on concrete assignment to teams and dates ... what is the current agreement on EventLogging alert handling?

Icinga for example monitors whether or not the EventLogging jobs are running. If I read puppet correctly, Icinga (in case of problems) alerts the ops IRC channel, and sends email to the analytics contact group [1].

Toby, just to be explicit and meet IIDHOTMLIDH [2], does that mean whoever from Icinga's analytics contact group first sees an alert is expected to take ownership and act on it?

Toby, since it also has been suggested by some to page people in case of EventLogging issues, do we really want to do that?

Best regards, Christian

[1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/14357072e8d15e00a8a47...

[2] If it didn't happen on the mailing list, it didn't happen.

Christian Aistleitner

30 Apr 30 Apr

9:51 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby,

ping.

Have fun, Christian

On Tue, Apr 22, 2014 at 01:06:04PM +0200, Christian Aistleitner wrote:

...

Hi Toby,

[ moving Ops to Bcc, as it seems to have become analytics specific ]

On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

...
The Analytics team will own EventLogging as soon as we can, but we need to get consensus on the details. [...] Please feel free to add/discuss. We've had some preliminary discussions with Andrew Otto but need to follow up with Rob and Ori.

as the EventLogging transition page at

https://www.mediawiki.org/wiki/Analytics/EventLogging

basically came to a rest in the past two weeks, and is a bit scarce on concrete assignment to teams and dates ... what is the current agreement on EventLogging alert handling?

Icinga for example monitors whether or not the EventLogging jobs are running. If I read puppet correctly, Icinga (in case of problems) alerts the ops IRC channel, and sends email to the analytics contact group [1].

Toby, just to be explicit and meet IIDHOTMLIDH [2], does that mean whoever from Icinga's analytics contact group first sees an alert is expected to take ownership and act on it?

Toby, since it also has been suggested by some to page people in case of EventLogging issues, do we really want to do that?

Best regards, Christian

[1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/14357072e8d15e00a8a47...

[2] If it didn't happen on the mailing list, it didn't happen.

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

14 May 14 May

12:58 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby,

ping.

Have fun, Christian

On Wed, Apr 30, 2014 at 09:51:59PM +0200, Christian Aistleitner wrote:

...

Hi Toby,

ping.

Have fun, Christian

On Tue, Apr 22, 2014 at 01:06:04PM +0200, Christian Aistleitner wrote:

...
Hi Toby,

[ moving Ops to Bcc, as it seems to have become analytics specific ]

On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

...
The Analytics team will own EventLogging as soon as we can, but we need to get consensus on the details. [...] Please feel free to add/discuss. We've had some preliminary discussions with Andrew Otto but need to follow up with Rob and Ori.

as the EventLogging transition page at

https://www.mediawiki.org/wiki/Analytics/EventLogging

basically came to a rest in the past two weeks, and is a bit scarce on concrete assignment to teams and dates ... what is the current agreement on EventLogging alert handling?

Icinga for example monitors whether or not the EventLogging jobs are running. If I read puppet correctly, Icinga (in case of problems) alerts the ops IRC channel, and sends email to the analytics contact group [1].

Toby, just to be explicit and meet IIDHOTMLIDH [2], does that mean whoever from Icinga's analytics contact group first sees an alert is expected to take ownership and act on it?

Toby, since it also has been suggested by some to page people in case of EventLogging issues, do we really want to do that?

Best regards, Christian

[1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/14357072e8d15e00a8a47...

[2] If it didn't happen on the mailing list, it didn't happen.

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

30 May 30 May

3:48 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby,

since the EventLogging status never made it to the lists; due to

https://www.mediawiki.org/w/index.php?title=Analytics/EventLogging&diff=...

and to a Backlog section getting added on the page, it looks like the ownership migration is officially happening.

Which team will be responsible for which parts, and with what expectations?

Have fun, Christian

On Wed, May 14, 2014 at 12:58:11PM +0200, Christian Aistleitner wrote:

...

Hi Toby,

ping.

Have fun, Christian

On Wed, Apr 30, 2014 at 09:51:59PM +0200, Christian Aistleitner wrote:

...
Hi Toby,

ping.

Have fun, Christian

On Tue, Apr 22, 2014 at 01:06:04PM +0200, Christian Aistleitner wrote:

...
Hi Toby,

[ moving Ops to Bcc, as it seems to have become analytics specific ]

On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

...
The Analytics team will own EventLogging as soon as we can, but we need to get consensus on the details. [...] Please feel free to add/discuss. We've had some preliminary discussions with Andrew Otto but need to follow up with Rob and Ori.

as the EventLogging transition page at

https://www.mediawiki.org/wiki/Analytics/EventLogging

basically came to a rest in the past two weeks, and is a bit scarce on concrete assignment to teams and dates ... what is the current agreement on EventLogging alert handling?

Icinga for example monitors whether or not the EventLogging jobs are running. If I read puppet correctly, Icinga (in case of problems) alerts the ops IRC channel, and sends email to the analytics contact group [1].

Toby, just to be explicit and meet IIDHOTMLIDH [2], does that mean whoever from Icinga's analytics contact group first sees an alert is expected to take ownership and act on it?

Toby, since it also has been suggested by some to page people in case of EventLogging issues, do we really want to do that?

Best regards, Christian

[1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/14357072e8d15e00a8a47...

[2] If it didn't happen on the mailing list, it didn't happen.

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/

Faidon Liambotis

20 Mar 20 Mar

9:47 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:

...

The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned.

This sounds like a bug. A temporary issue (database unavailability, for whatever reason) resulting in a permanent crash of the service needing manual action to restore. This needs to be fixed.

...

This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing.

We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement.

I can't comment on the general involvement of analytics in this area, but I do think that responding to Icinga alerts is primarily a techops responsibility. We can and should escalate as necessary and it's obviously always nice & appreciated to see non-ops people lurking around in #wikimedia-operations and jumping in on failures but I don't think I'd blame anyone else for not reacting to an alert. Especially in this case, as anyone with a trivial investigation could just come into the conclusion that a simple restart of the upstart job would fix this (AIUI).

...

Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

We generally try to keep paging to a minimum. First, for our personal sanities :), but more importantly, because if your phone keeps beeping all day, you become accustomed to it and it will become easier to ignore a "site is down" alert.

IMO, pages are for very serious alerts. That doesn't mean that the other (CRITICAL but non-paging) alerts are meant to be ignored for days. In my experience, I see very few opsens actively monitor the Icinga unhandled services page (let alone fix random issues or even their own issues as they see them) and I think we can do better than that.

I personally check that page several times within my day, as well as the IRC log, but I do wonder what others do or how they feel about this, especially as we've agreed to scale up the amount of checks (and hence alerts) that we have.

Faidon

Toby Negrin

11:49 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done.

Faidon -- I would pull in Andrew, but I'm really concerned about his workload with the many tasks that need to be done to productize Kafka/Hadoop. Can you identify another resource who might be able to help (set up/configure monitoring for example)

-Toby

On Thu, Mar 20, 2014 at 1:47 PM, Faidon Liambotis faidon@wikimedia.orgwrote:

...

On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:

...
The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed

to

...
connect, it hit the threshold, and was not re-spawned.

This sounds like a bug. A temporary issue (database unavailability, for whatever reason) resulting in a permanent crash of the service needing manual action to restore. This needs to be fixed.

...
This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who

noticed

...
image loading performance data was missing.

We have to fix this. The level of maintenance that EventLogging gets is

not

...
proportional to its usage across the organization. Analytics, I really

need

...
you to step up your involvement.

I can't comment on the general involvement of analytics in this area, but I do think that responding to Icinga alerts is primarily a techops responsibility. We can and should escalate as necessary and it's obviously always nice & appreciated to see non-ops people lurking around in #wikimedia-operations and jumping in on failures but I don't think I'd blame anyone else for not reacting to an alert. Especially in this case, as anyone with a trivial investigation could just come into the conclusion that a simple restart of the upstart job would fix this (AIUI).

...
Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

We generally try to keep paging to a minimum. First, for our personal sanities :), but more importantly, because if your phone keeps beeping all day, you become accustomed to it and it will become easier to ignore a "site is down" alert.

IMO, pages are for very serious alerts. That doesn't mean that the other (CRITICAL but non-paging) alerts are meant to be ignored for days. In my experience, I see very few opsens actively monitor the Icinga unhandled services page (let alone fix random issues or even their own issues as they see them) and I think we can do better than that.

I personally check that page several times within my day, as well as the IRC log, but I do wonder what others do or how they feel about this, especially as we've agreed to scale up the amount of checks (and hence alerts) that we have.

Faidon

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

Ori Livneh

21 Mar 21 Mar

6:35 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 3:49 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...

We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done.

* The Ganglia scripts need to be fixed. * A daily report should go out reporting the number of valid and invalid events logged, broken down by schema. * Someone needs to scan that report for anything usual, file bugs for code that violates its data model, and follow-up with the relevant team to ensure a fix. * Alerts need to be responded to. * Once a month, the backup process (vanadium -> stat1001 -> tridge) should get a quick lookover to ensure that it is functioning. * Once every six months, a drill should be conducted to test system failover and recovery procedures. * There should be a designated person to provide technical advice and Gerrit code review for new instrumentation code. (This has already scaled beyond just me -- folks like Matt F, Yuvi, Jon, Bryan, etc. have the requisite expertise. But someone needs to own this, and be accountable that code review happens in a prompt fashion.) * Bugs reported in Bugzilla should be acknowledged and resolved.

Toby, I think you guys have the requisite talent and capacity to handle it internally.

Aaron Halfaker

3:31 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Thanks Ori for pushing us on this. EventLogging is one of my primary tools for getting things done, so it's very important to me that the system is well supported.

On Fri, Mar 21, 2014 at 12:35 AM, Ori Livneh ori@wikimedia.org wrote:

...

On Thu, Mar 20, 2014 at 3:49 PM, Toby Negrin tnegrin@wikimedia.orgwrote:

...
We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done.

The Ganglia scripts need to be fixed.

A daily report should go out reporting the number of valid and invalid

events logged, broken down by schema.

Someone needs to scan that report for anything usual, file bugs for code

that violates its data model, and follow-up with the relevant team to ensure a fix.

Alerts need to be responded to.

Once a month, the backup process (vanadium -> stat1001 -> tridge) should

get a quick lookover to ensure that it is functioning.

Once every six months, a drill should be conducted to test system

failover and recovery procedures.

There should be a designated person to provide technical advice and

Gerrit code review for new instrumentation code. (This has already scaled beyond just me -- folks like Matt F, Yuvi, Jon, Bryan, etc. have the requisite expertise. But someone needs to own this, and be accountable that code review happens in a prompt fashion.)

Bugs reported in Bugzilla should be acknowledged and resolved.

Toby, I think you guys have the requisite talent and capacity to handle it internally.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3868

Age (days ago)

3939

Last active (days ago)

analytics@lists.wikimedia.org

27 comments

13 participants

tags (0)

participants (13)

Aaron Halfaker
Andrew Otto
Christian Aistleitner
Dan Andreescu
Daniel Zahn
Dario Taraborelli
Faidon Liambotis
Greg Grossmeier
matanya
Nuria Ruiz
Ori Livneh
Sean Pringle
Toby Negrin