Hi,
with the recent events around EventLogging, I think a high-level
round-up of what happened is overdue.
There have been four unrelated issues:
* Failed test to have EventLogging access its database through a high-availability proxy
* Event volume growing beyond whan EventLogging's database writer could handle
* db1020 outage
* Untested EventLogging code got accidentally deployed
(Please find the details below)
For the last three of the four issues, backfilling is still pending,
as focus up to now was on getting EventLogging under control again. As
it seems EventLogging is under control again (keeping fingers crossed
for the second item in the above list), backfilling is next.
Sorry for the inconveniences,
Christian
* Failed test to have EventLogging access its database through a high-availability proxy
Production's firewall got in the way [1].
Data got backfilled in the database from the logs. So no data got lost.
The switch to a high-availabily proxy happened in the meantime.
* Event volume growing beyond whan EventLogging's database writer could handle
Event volume increased by not quite 60% over-night and the database
writer could not handle the increased volume [2].
The database writer got restructured and got deployed yesterday in the
UTC evening. Since then the restructured database writer could easily
handle the increased volume. But before declaring victory, we have
to wait for a few days and see how it handles hours of increased
activity.
During the time that the database writer could not handle the event
volume, logging to disk could keep up with the increased volume, so
backfilling should work, but it is still pending.
* db1020 outage
The database process of the m2 cluster (the one which EventLogging's
database writer writes to) died [1]. Ops handled the issue promptly
and failed-over to a slave database.
We have logs in plain files for the affected periods. So backfilling
should work, but it is still pending.
* Untested EventLogging code got accidentally deployed
It seems around trying to fix EventLogging for beta, an untested
version accidentally got deployed to production. This accidentally
deployed version stopped writing to the database from time to time and
then started working again [4].
A working version has been deployed again.
We have logs in plain files for the affected periods. So backfilling
should work, but is still pending.
[1]
https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLo…
[2]
https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLo…
[3]
https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html
[4]
https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLo…
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage:
http://quelltextlich.at/
---------------------------------------------------------------