[Adding some other mailing lists in Cc]
Hi everybody,
as a lot of you have probably already noticed yesterday reading the operations@ mailing list, we had an outage of the Kafka Main eqiad cluster that forced us to switch the Eventbus and Eventstreams services to codfw.
All the precise timings will be listed in https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eq..., but for a quick glimpse:
2018-07-11 17:00 UTC - Eventbus service switched to codfw 2018-07-11 18:44 UTC - Eventstreams service switched to codfw
We are going to switch back those services to eqiad during the next couple of hours. The consumers of the Eventstreams service may get some failures or data drops, apologies in advance for the trouble.
Cheers,
Luca
Il giorno gio 12 lug 2018 alle ore 00:00 Luca Toscano < ltoscano@wikimedia.org> ha scritto:
Hi everybody,
as you might have seen from the operations' channel on IRC the Kafka Main Eqiad cluster (kafka100[1-3].eqiad.wmnet) suffered a long outage due to new topics pushed out with too long names (causing fs operation issues, etc..). I'll update this email thread tomorrow EU time with more details, tasks, precise root cause, etc.., but the important bit to know is that Eventbus and Eventstreams have been failed over to the Kafka Main Codfw cluster. This should be transparent to everybody but please let us know otherwise.
Thanks for the patience!
(a very sleepy :) Luca