EventLogging postmortem, and maintenance responsibilities

List overview All Threads
Download

newer

older

purging old data from eventlogging...

s1-analytics-slave lag

Ori Livneh

20 Mar 2014 20 Mar '14

7:52 p.m.

At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming connections. At some point during the subsequent hour, MariaDB had either crashed or been manually restarted. Sean noticed that the database was choking on some queries from the researchers and notified the wmfresearch list. During the time that the database server was out or rejecting connection, the EventLogging writer that writes to db1047 was repeatedly failing to connect to it: sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on 'db1047.eqiad.wmnet' (111)") The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned. This triggered an Icinga alert: [00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047 This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing. We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement. It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices. Sean proposed moving the EventLogging database to m2, so that it runs on separate hardware from the research databases. I think he's right. I filed < https://rt.wikimedia.org/Ticket/Display.html?id=7081> to request the migration. There is some code rot around the Ganglia and Graphite monitoring code for EventLogging. I don't think it would take much to fix. Could the Analytics team take this on? The Puppet code is well-documented. < https://wikitech.wikimedia.org/wiki/EventLogging> could use some updating, but it is mostly current. Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*. --- Ori Livneh ori(a)wikimedia.org

Attachments:

attachment.htm (text/html — 3.0 KB)

Show replies by date

Dan Andreescu

20 Mar 20 Mar

8:50 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Thank you for the detailed write-up Ori We have to fix this. The level of maintenance that EventLogging gets is not

...

proportional to its usage across the organization. Analytics, I really need you to step up your involvement. It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices.

Indeed, the owner seat is vacant. According to a recent discussion on the analytics list, we did not yet consider ourselves the proper owners of EventLogging. Our sprint planning is today and I'll bring it up and note its importance in light of this down time. Sean proposed moving the EventLogging database to m2, so that it runs on

...

separate hardware from the research databases. I think he's right. I filed < https://rt.wikimedia.org/Ticket/Display.html?id=7081> to request the migration.

Thank you, I support isolation. Finally, I think EventLogging Icinga alerts should have a higher profile,

...

and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

matanya

9:38 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Regarding your last point, it seems like a mail is sent in this case, to the following users: define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric } and the role includes this group in the contacts: nrpe::monitor_service { 'eventlogging': ensure => 'present', description => 'Check status of defined EventLogging jobs', nrpe_command => '/usr/lib/nagios/plugins/check_eventlogging_jobs', require => File['/usr/lib/nagios/plugins/check_eventlogging_jobs'], contact_group => 'admins,analytics', } If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it. Matanya On 2014-03-20 13:50, Dan Andreescu wrote:

...

Thank you for the detailed write-up Ori

We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement. It was not long ago that EventLogging was running reliably for months at a time. What has changed is not system load, but the owner seat becoming vacant, leading to a gradual deterioration of the quality of monitoring and auditing practices.

Sean proposed moving the EventLogging database to m2, so that it runs on separate hardware from the research databases. I think he's right. I filed <https://rt.wikimedia.org/Ticket/Display.html?id=7081 [1]> to request the migration.

Thank you, I support isolation.

Finally, I think EventLogging Icinga alerts should have a higher profile, and possibly page someone. Issues can usually be debugged using the eventloggingctl tool on Vanadium and by inspecting the log files on vanadium:/var/log/upstart/eventlogging-*.

I think this is the key reason the failure was ignored, so I agree here. We should at the very least forward these alerts as an email to analytics devs. I have no idea how to do that, if anyone would like to help that'd be great. _______________________________________________ Ops mailing list Ops(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops [2]

Links: ------ [1] https://rt.wikimedia.org/Ticket/Display.html?id=7081 [2] https://lists.wikimedia.org/mailman/listinfo/ops

Dan Andreescu

10:18 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 8:38 AM, matanya <matanya(a)foss.co.il> wrote:

...

Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future.

...

If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it.

Andrew Otto

10:55 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

This is not an eventlogging alert list, but a general analytics alert list. Anyone in analytics should probably be on it. On Mar 20, 2014, at 9:18 AM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote:

...

On Thu, Mar 20, 2014 at 8:38 AM, matanya <matanya(a)foss.co.il> wrote: Regarding your last point, it seems like a mail is sent in this case, to the following users: define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric } Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future. If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it. I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it. _______________________________________________ Ops mailing list Ops(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

Nuria Ruiz

10:57 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

This is not an eventlogging alert list, but a general analytics alert

list. Anyone in analytics should probably be on it. +1 On Thu, Mar 20, 2014 at 2:55 PM, Andrew Otto <otto(a)wikimedia.org> wrote:

...

If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it. _______________________________________________ Ops mailing list Ops(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

10:59 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

This is not an eventlogging alert list, but a general analytics alert

list. Anyone in analytics should probably be on it. +1 Oh yeah, in that case totally

Dan Andreescu

11:29 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

This is not an eventlogging alert list, but a general analytics alert

list. Anyone in analytics should probably be on it. +1 Oh yeah, in that case definitely please add anyone in analytics dev -

nuria, qchris, spetrea, csalvia

Daniel Zahn

11:49 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 7:29 AM, Dan Andreescu <dandreescu(a)wikimedia.org>wrote;wrote:

...

Oh yeah, in that case definitely please add anyone in analytics dev - nuria, qchris, spetrea, csalvia

Hi, so for that you needed 2 things.. contacts defined in the private repository (we do this to keep phone numbers private). dandreescu already existed but was not in the analytics group. nuria,qchris,spetrea and csalvia had to be created. I just did that. You are mail (not paging) contacts for now and you get it 24/7 (as usual when it's just mail), for all the "normal" events like CRITICAL, RECOVERY... You can see the details below. The second part is then adding those contacts to your analytics group. That part is in the public repository, and the patch is here: https://gerrit.wikimedia.org/r/#/c/119753/1 Daniel ---- <snip> ---- define contact{ contact_name csalvia alias Charles Salvia host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email csalvia(a)wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email } define contact{ contact_name nuria alias Nuria Ruiz host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email nuria(a)wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email } define contact{ contact_name qchris alias Christian Aistleitner host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email caistleitner(a)wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email } define contact{ contact_name spetrea alias Stefan Petrea host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email spetrea(a)wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email } define contact{ contact_name dandreescu alias Dan Andreescu host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,r,f service_notification_options c,r,f email dandreescu(a)wikimedia.org host_notification_commands host-notify-by-email service_notification_commands notify-by-email } -- Daniel Zahn <dzahn(a)wikimedia.org> Operations Engineer

Aaron Halfaker

21 Mar 21 Mar

12:18 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

FYI: Just filed an RT for Dario, Leila, Oliver and myself to be notified as well. On Thu, Mar 20, 2014 at 9:49 AM, Daniel Zahn <dzahn(a)wikimedia.org> wrote:

...

On Thu, Mar 20, 2014 at 7:29 AM, Dan Andreescu <dandreescu(a)wikimedia.org>wrote;wrote:

Oh yeah, in that case definitely please add anyone in analytics dev - nuria, qchris, spetrea, csalvia

Aaron Halfaker

20 Mar 20 Mar

11:52 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

We should probably add the researchers to this alert list too since we tend to be the ones causing the alerts. I'd rather be able to fix my own problems then have a dev track me down to figure out what I'm working on. On Thu, Mar 20, 2014 at 9:29 AM, Dan Andreescu <dandreescu(a)wikimedia.org>wrote;wrote:

...

This is not an eventlogging alert list, but a general analytics alert list. Anyone in analytics should probably be on it. +1 Oh yeah, in that case definitely please add anyone in analytics dev -

nuria, qchris, spetrea, csalvia _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

21 Mar 21 Mar

4 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi, On Thu, Mar 20, 2014 at 10:29:38AM -0400, Dan Andreescu wrote:

...

Oh yeah, in that case definitely please add anyone in analytics dev -

[...], qchris, [...]

I know it happened with best intentions, but please do not add me to lists/alerts without checking back with me. I hope https://gerrit.wikimedia.org/r/119796 should remove me again. Best regards, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

Aaron Halfaker

20 Mar 20 Mar

10:59 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

*re. moving EventLogging to m2* If we do that, we'll need to also set up a process for copying new events to db1047 so that we can continue to join EL data against enwiki. This is critical for our research. On Thu, Mar 20, 2014 at 8:18 AM, Dan Andreescu <dandreescu(a)wikimedia.org>wrote;wrote:

...

On Thu, Mar 20, 2014 at 8:38 AM, matanya <matanya(a)foss.co.il> wrote:

If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it.

Dario Taraborelli

11:31 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

definitely, segregating EL data from MediaWiki databases will significantly affect our ability to do research. On Mar 20, 2014, at 6:59 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

...

re. moving EventLogging to m2 If we do that, we'll need to also set up a process for copying new events to db1047 so that we can continue to join EL data against enwiki. This is critical for our research. On Thu, Mar 20, 2014 at 8:18 AM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote: On Thu, Mar 20, 2014 at 8:38 AM, matanya <matanya(a)foss.co.il> wrote: Regarding your last point, it seems like a mail is sent in this case, to the following users: define contactgroup { contactgroup_name analytics members dvanliere,ezachte,dtaraborelli,otto,milimetric } Thanks for the help, matanya. So dvanliere and ezachte can be removed from that list as dvanliere no longer works with us and ezachte is not connected to EventLogging at all. Now, I did not know I was on that list and was accidentally filtering the alert to my trash, so I apologize for that. I have updated my rules to make this an important message. I promise to respond to it as promptly as I can in the future. If someone is missing from this list or the check needs to be added to another service, i'll be glad to do it. I think we should also add nuria. I'll also follow up with Nuria and Ori separately to make sure those pinged by the alert can actually do something about it. _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

21 Mar 21 Mar

11 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 11:59 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org>wrote;wrote:

...

Noted. Sean -- DBA @ WMF

Greg Grossmeier

5:20 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

Can someone from Analytics own this post-mortem and put it on the wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation Please add specific next steps (with bug#, RT#s, or gerrit urls), even (especially) things you haven't done yet and are just "nice to have". Thanks, Greg -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Dan Andreescu

7:40 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

...

I think the "switch over" to this being owned by the Analytics team has not been officially done yet. We're going to talk about it early next week. Until then, I'd like to defer this task as we have to meet our other commitments.

Christian Aistleitner

28 Mar 28 Mar

2:58 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Analytics Dev team, On Thu, Mar 20, 2014 at 01:20:54PM -0700, Greg Grossmeier wrote:

...

[ At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming connections. At some point during the subsequent hour, MariaDB had either crashed or been manually restarted. Sean noticed that the database was choking on some queries from the researchers and notified the wmfresearch list.

it's been a week, and I cannot find the post-mortem Greg requested at the above URL :-/ Neither did I see a response from our team to Greg's email. I lost track of our EventLogging responsibilities during the recent back and forth. So: Toby, are we actually grabbing Greg's item or are we pushing back on it? Best regards, Christian P.S.: Toby, if we're grabbing it: I totally lack knowledge about both EventLogging, and the incident itself. So, be prepared for double slow start if I get to work on it. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

Christian Aistleitner

3 Apr 3 Apr

10:27 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby, and zooooooooom ... there goes another week without us even deciding whether or not we feel responsible doing the incident documentation and follow-up work. :-D I feel somewhat embarrassed that after two weeks, and after the ping on mailing lists, we still did not yet manage to tell Greg at least whether or not we'll work on it. So,—if you do not chime in/push back by then—I'll be bold and I'll consider our given lip service around EventLogging a commitment and start working on it on Monday (2014-04-07). Best regards, Christian On Thu, Mar 27, 2014 at 06:58:27PM +0100, Christian Aistleitner wrote:

...

Hi Analytics Dev team, On Thu, Mar 20, 2014 at 01:20:54PM -0700, Greg Grossmeier wrote:

-- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

Toby Negrin

4 Apr 4 Apr

10:51 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi all, Christian -- thanks for following up on this. I've created a ticket[1] for this issue as a production issue. Kevin -- please triage tomorrow in standup. We can own the actual incident report but we'll need to get some help from Ori in understanding how to perform the post mortem. The current status for EventLogging support is that Ori, the Analytics team, the Operations team and the Platform teams are discussing the handover of EventLogging. The Analytics team will own EventLogging as soon as we can, but we need to get consensus on the details. I've written up our discussions on this wiki page[2]. Please feel free to add/discuss. We've had some preliminary discussions with Andrew Otto but need to follow up with Rob and Ori. -Toby [1] https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/1526 [1] https://www.mediawiki.org/wiki/Analytics/EventLogging On Thu, Apr 3, 2014 at 6:27 AM, Christian Aistleitner < christian(a)quelltextlich.at> wrote:

...

Hi Toby, and zooooooooom ... there goes another week without us even deciding whether or not we feel responsible doing the incident documentation and follow-up work. :-D I feel somewhat embarrassed that after two weeks, and after the ping on mailing lists, we still did not yet manage to tell Greg at least whether or not we'll work on it. So,--if you do not chime in/push back by then--I'll be bold and I'll consider our given lip service around EventLogging a commitment and start working on it on Monday (2014-04-07). Best regards, Christian On Thu, Mar 27, 2014 at 06:58:27PM +0100, Christian Aistleitner wrote:

Hi Analytics Dev team, On Thu, Mar 20, 2014 at 01:20:54PM -0700, Greg Grossmeier wrote: > <quote name="Ori Livneh" date="2014-03-20" time="03:52:01 -0700"> > > [ At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming > > connections. At some point during the subsequent hour, MariaDB had

either

> > crashed or been manually restarted. Sean noticed that the database

was

> > choking on some queries from the researchers and notified the

wmfresearch

list.

Christian Aistleitner

22 Apr 22 Apr

8:06 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby, [ moving Ops to Bcc, as it seems to have become analytics specific ] On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

...

The Analytics team will own EventLogging as soon as we can, but we need to get consensus on the details. [...] Please feel free to add/discuss. We've had some preliminary discussions with Andrew Otto but need to follow up with Rob and Ori.

as the EventLogging transition page at https://www.mediawiki.org/wiki/Analytics/EventLogging basically came to a rest in the past two weeks, and is a bit scarce on concrete assignment to teams and dates ... what is the current agreement on EventLogging alert handling? Icinga for example monitors whether or not the EventLogging jobs are running. If I read puppet correctly, Icinga (in case of problems) alerts the ops IRC channel, and sends email to the analytics contact group [1]. Toby, just to be explicit and meet IIDHOTMLIDH [2], does that mean whoever from Icinga's analytics contact group first sees an alert is expected to take ownership and act on it? Toby, since it also has been suggested by some to page people in case of EventLogging issues, do we really want to do that? Best regards, Christian [1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/14357072e8d15e00a8a4… [2] If it didn't happen on the mailing list, it didn't happen. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

Christian Aistleitner

1 May 1 May

4:51 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby, ping. Have fun, Christian On Tue, Apr 22, 2014 at 01:06:04PM +0200, Christian Aistleitner wrote:

...

Hi Toby, [ moving Ops to Bcc, as it seems to have become analytics specific ] On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

...

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

14 May 14 May

7:58 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby, ping. Have fun, Christian On Wed, Apr 30, 2014 at 09:51:59PM +0200, Christian Aistleitner wrote:

...

Hi Toby, ping. Have fun, Christian On Tue, Apr 22, 2014 at 01:06:04PM +0200, Christian Aistleitner wrote:

Hi Toby, [ moving Ops to Bcc, as it seems to have become analytics specific ] On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

...

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

30 May 30 May

10:48 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Hi Toby, since the EventLogging status never made it to the lists; due to https://www.mediawiki.org/w/index.php?title=Analytics/EventLogging&diff… and to a Backlog section getting added on the page, it looks like the ownership migration is officially happening. Which team will be responsible for which parts, and with what expectations? Have fun, Christian On Wed, May 14, 2014 at 12:58:11PM +0200, Christian Aistleitner wrote:

...

Hi Toby, ping. Have fun, Christian On Wed, Apr 30, 2014 at 09:51:59PM +0200, Christian Aistleitner wrote:

Hi Toby, ping. Have fun, Christian On Tue, Apr 22, 2014 at 01:06:04PM +0200, Christian Aistleitner wrote:

Hi Toby, [ moving Ops to Bcc, as it seems to have become analytics specific ] On Thu, Apr 03, 2014 at 06:51:20PM -0700, Toby Negrin wrote:

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Faidon Liambotis

21 Mar 21 Mar

5:47 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:

...

The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned.

This sounds like a bug. A temporary issue (database unavailability, for whatever reason) resulting in a permanent crash of the service needing manual action to restore. This needs to be fixed.

...

This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing. We have to fix this. The level of maintenance that EventLogging gets is not proportional to its usage across the organization. Analytics, I really need you to step up your involvement.

I can't comment on the general involvement of analytics in this area, but I do think that responding to Icinga alerts is primarily a techops responsibility. We can and should escalate as necessary and it's obviously always nice & appreciated to see non-ops people lurking around in #wikimedia-operations and jumping in on failures but I don't think I'd blame anyone else for not reacting to an alert. Especially in this case, as anyone with a trivial investigation could just come into the conclusion that a simple restart of the upstart job would fix this (AIUI).

...

We generally try to keep paging to a minimum. First, for our personal sanities :), but more importantly, because if your phone keeps beeping all day, you become accustomed to it and it will become easier to ignore a "site is down" alert. IMO, pages are for very serious alerts. That doesn't mean that the other (CRITICAL but non-paging) alerts are meant to be ignored for days. In my experience, I see very few opsens actively monitor the Icinga unhandled services page (let alone fix random issues or even their own issues as they see them) and I think we can do better than that. I personally check that page several times within my day, as well as the IRC log, but I do wonder what others do or how they feel about this, especially as we've agreed to scale up the amount of checks (and hence alerts) that we have. Faidon

Toby Negrin

7:49 a.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done. Faidon -- I would pull in Andrew, but I'm really concerned about his workload with the many tasks that need to be done to productize Kafka/Hadoop. Can you identify another resource who might be able to help (set up/configure monitoring for example) -Toby On Thu, Mar 20, 2014 at 1:47 PM, Faidon Liambotis <faidon(a)wikimedia.org>wrote;wrote:

...

On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote:

The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed

connect, it hit the threshold, and was not re-spawned.

This sounds like a bug. A temporary issue (database unavailability, for whatever reason) resulting in a permanent crash of the service needing manual action to restore. This needs to be fixed.

This alert was not responded to. I finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who

noticed

image loading performance data was missing. We have to fix this. The level of maintenance that EventLogging gets is

not

proportional to its usage across the organization. Analytics, I really

need

you to step up your involvement.

Ori Livneh

2:35 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

On Thu, Mar 20, 2014 at 3:49 PM, Toby Negrin <tnegrin(a)wikimedia.org> wrote:

...

We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done.

* The Ganglia scripts need to be fixed. * A daily report should go out reporting the number of valid and invalid events logged, broken down by schema. * Someone needs to scan that report for anything usual, file bugs for code that violates its data model, and follow-up with the relevant team to ensure a fix. * Alerts need to be responded to. * Once a month, the backup process (vanadium -> stat1001 -> tridge) should get a quick lookover to ensure that it is functioning. * Once every six months, a drill should be conducted to test system failover and recovery procedures. * There should be a designated person to provide technical advice and Gerrit code review for new instrumentation code. (This has already scaled beyond just me -- folks like Matt F, Yuvi, Jon, Bryan, etc. have the requisite expertise. But someone needs to own this, and be accountable that code review happens in a prompt fashion.) * Bugs reported in Bugzilla should be acknowledged and resolved. Toby, I think you guys have the requisite talent and capacity to handle it internally.

Aaron Halfaker

11:31 p.m.

New subject: [Ops] EventLogging postmortem, and maintenance responsibilities

Thanks Ori for pushing us on this. EventLogging is one of my primary tools for getting things done, so it's very important to me that the system is well supported. On Fri, Mar 21, 2014 at 12:35 AM, Ori Livneh <ori(a)wikimedia.org> wrote:

...

On Thu, Mar 20, 2014 at 3:49 PM, Toby Negrin <tnegrin(a)wikimedia.org>wrote;wrote:

We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done.

3618

days inactive

3689

days old

analytics@lists.wikimedia.org

Manage subscription

27 comments

13 participants

tags (0)

participants (13)

Aaron Halfaker
Andrew Otto
Christian Aistleitner
Dan Andreescu
Daniel Zahn
Dario Taraborelli
Faidon Liambotis
Greg Grossmeier
matanya
Nuria Ruiz
Ori Livneh
Sean Pringle
Toby Negrin