Hi all,
EventStreams just experienced a 24 hour ‘outage’. There were no dropped
messages, but for about 24 hours no messages were sent to connected
EventStreams clients.
I’ve written up the Incident Report here:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20170829-EventSt…
The worst part about this is that we didn’t know that there was a problem
until a user notified me on IRC. We monitor and alert on pieces of
EventStreams infrastructure, but don’t monitor topic volume, as it varies
and is hard to get right. However, this shouldn’t have taken 24 hours and
a user for us (me) to notice, so I’ve created
https://phabricator.wikimedia.org/T174493 to help us catch something like
this in the future.
Apologies if this caused any inconvenience.
-Andrew Otto
Systems Engineer, Wikimedia Foundation