Hi all,

EventStreams just experienced a 24 hour ‘outage’.  There were no dropped messages, but for about 24 hours no messages were sent to connected EventStreams clients.

I’ve written up the Incident Report here:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20170829-EventStreams

The worst part about this is that we didn’t know that there was a problem until a user notified me on IRC.  We monitor and alert on pieces of EventStreams infrastructure, but don’t monitor topic volume, as it varies and is hard to get right.  However, this shouldn’t have taken 24 hours and a user for us (me) to notice, so I’ve created https://phabricator.wikimedia.org/T174493 to help us catch something like this in the future.

Apologies if this caused any inconvenience.

-Andrew Otto
 Systems Engineer, Wikimedia Foundation