Hi,
with the recent events around EventLogging, I think a high-level round-up of what happened is overdue.
There have been four unrelated issues: * Failed test to have EventLogging access its database through a high-availability proxy * Event volume growing beyond whan EventLogging's database writer could handle * db1020 outage * Untested EventLogging code got accidentally deployed (Please find the details below)
For the last three of the four issues, backfilling is still pending, as focus up to now was on getting EventLogging under control again. As it seems EventLogging is under control again (keeping fingers crossed for the second item in the above list), backfilling is next.
Sorry for the inconveniences, Christian
* Failed test to have EventLogging access its database through a high-availability proxy
Production's firewall got in the way [1]. Data got backfilled in the database from the logs. So no data got lost.
The switch to a high-availabily proxy happened in the meantime.
* Event volume growing beyond whan EventLogging's database writer could handle
Event volume increased by not quite 60% over-night and the database writer could not handle the increased volume [2].
The database writer got restructured and got deployed yesterday in the UTC evening. Since then the restructured database writer could easily handle the increased volume. But before declaring victory, we have to wait for a few days and see how it handles hours of increased activity.
During the time that the database writer could not handle the event volume, logging to disk could keep up with the increased volume, so backfilling should work, but it is still pending.
* db1020 outage
The database process of the m2 cluster (the one which EventLogging's database writer writes to) died [1]. Ops handled the issue promptly and failed-over to a slave database.
We have logs in plain files for the affected periods. So backfilling should work, but it is still pending.
* Untested EventLogging code got accidentally deployed
It seems around trying to fix EventLogging for beta, an untested version accidentally got deployed to production. This accidentally deployed version stopped writing to the database from time to time and then started working again [4].
A working version has been deployed again.
We have logs in plain files for the affected periods. So backfilling should work, but is still pending.
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLog... [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLog... [3] https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html [4] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLog...
Hi,
On Fri, Nov 21, 2014 at 02:22:47PM +0100, Christian Aistleitner wrote:
For the last three of the four issues, backfilling is still pending, [...]
Backfilling the database just finished (without issues).
Numbers in the various EventLogging-powered dashboards should jump up again after the dashboard's next regeneration run (i.e.: within 24 hours for most dashboards).
Have fun, Christian
P.S.: Since I received emails asking about how much off the dashboards are/have been, I attached
EventLogging-backfilling-2014-11.png
which shows how much each hour was off for the NavigationTiming schema.
Hi,
On Mon, Nov 24, 2014 at 03:32:48PM +0100, Christian Aistleitner wrote:
On Fri, Nov 21, 2014 at 02:22:47PM +0100, Christian Aistleitner wrote:
For the last three of the four issues, backfilling is still pending, [...]
Backfilling the database just finished (without issues).
I forgot to mention that the implemented fix proved effective and survived the hours of high activity without issues. So events are getting written reliably to the database again.
Have fun, Christian
Awesome -- nice work all. Thanks for the updates Christian.
-Toby
On Mon, Nov 24, 2014 at 6:45 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Mon, Nov 24, 2014 at 03:32:48PM +0100, Christian Aistleitner wrote:
On Fri, Nov 21, 2014 at 02:22:47PM +0100, Christian Aistleitner wrote:
For the last three of the four issues, backfilling is still pending,
[...]
Backfilling the database just finished (without issues).
I forgot to mention that the implemented fix proved effective and survived the hours of high activity without issues. So events are getting written reliably to the database again.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thankee! Springle was asking wth was going on with the multiple EL consumers on stat2, which I assume is this - is he on this list?
On 24 November 2014 at 09:46, Toby Negrin tnegrin@wikimedia.org wrote:
Awesome -- nice work all. Thanks for the updates Christian.
-Toby
On Mon, Nov 24, 2014 at 6:45 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Mon, Nov 24, 2014 at 03:32:48PM +0100, Christian Aistleitner wrote:
On Fri, Nov 21, 2014 at 02:22:47PM +0100, Christian Aistleitner wrote:
For the last three of the four issues, backfilling is still pending,
[...]
Backfilling the database just finished (without issues).
I forgot to mention that the implemented fix proved effective and survived the hours of high activity without issues. So events are getting written reliably to the database again.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks! It's really helpful to see a graph. :)
On Mon, Nov 24, 2014 at 6:45 AM, Christian Aistleitner <
christian@quelltextlich.at> wrote:
Hi,
On Mon, Nov 24, 2014 at 03:32:48PM +0100, Christian Aistleitner wrote:
On Fri, Nov 21, 2014 at 02:22:47PM +0100, Christian Aistleitner wrote:
For the last three of the four issues, backfilling is still pending,
[...]
Backfilling the database just finished (without issues).
I forgot to mention that the implemented fix proved effective and survived the hours of high activity without issues. So events are getting written reliably to the database again.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
On Mon, Nov 24, 2014 at 10:19:44AM -0500, Oliver Keyes wrote:
Thankee! Springle was asking wth was going on with the multiple EL consumers on stat2, which I assume is this - is he on this list?
Ouch. It seems the heads up to springle did not make it out of my Outbox :-(
I'll follow-up with him off-list.
Thanks, Christian