EventStreams just experienced a 24 hour ‘outage’. There were no dropped
messages, but for about 24 hours no messages were sent to connected
I’ve written up the Incident Report here:
The worst part about this is that we didn’t know that there was a problem
until a user notified me on IRC. We monitor and alert on pieces of
EventStreams infrastructure, but don’t monitor topic volume, as it varies
and is hard to get right. However, this shouldn’t have taken 24 hours and
a user for us (me) to notice, so I’ve created
to help us catch something like
this in the future.
Apologies if this caused any inconvenience.
Systems Engineer, Wikimedia Foundation