Hi,
just a quick heads up that since 2015-01-07 ~1:55 only <30% of the EventLogging events are getting written to the database.
It seems a deployment went wrong and validation is no longer working as expected. Somewhere >70% of the messages no longer validate and hence do not get written to the database.
The raw log files (pre-validation) are still getting written, so data is not lost, and backfilling is possible.
Best regards, Christian
[ Adding eventlogging-alerts to CC ]
On Wed, Jan 07, 2015 at 12:15:09PM +0100, quelltextlich e.U. - Christian Aistleitner wrote:
Hi,
just a quick heads up that since 2015-01-07 ~1:55 only <30% of the EventLogging events are getting written to the database.
It seems a deployment went wrong and validation is no longer working as expected. Somewhere >70% of the messages no longer validate and hence do not get written to the database.
The raw log files (pre-validation) are still getting written, so data is not lost, and backfilling is possible.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Team:
Issues on event logging have been solved, outage of client side events (did not affected server side events) lasted about 12 hours.
Please see: http://picpaste.com/Screen_Shot_2015-01-07_at_10.50.28_AM-NsMSPgHp.png
Thanks,
Nuria
On Wed, Jan 7, 2015 at 3:57 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
[ Adding eventlogging-alerts to CC ]
On Wed, Jan 07, 2015 at 12:15:09PM +0100, quelltextlich e.U. - Christian Aistleitner wrote:
Hi,
just a quick heads up that since 2015-01-07 ~1:55 only <30% of the EventLogging events are getting written to the database.
It seems a deployment went wrong and validation is no longer working as expected. Somewhere >70% of the messages no longer validate and hence do not get written to the database.
The raw log files (pre-validation) are still getting written, so data is not lost, and backfilling is possible.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I talked about this at Scrum of Scrums, and added this image to the notes I just sent out. I said we're leaning towards not backfilling and are willing to be convinced otherwise. We'll see what people say.
On Wed, Jan 7, 2015 at 1:58 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Team:
Issues on event logging have been solved, outage of client side events (did not affected server side events) lasted about 12 hours.
Please see: http://picpaste.com/Screen_Shot_2015-01-07_at_10.50.28_AM-NsMSPgHp.png
Thanks,
Nuria
On Wed, Jan 7, 2015 at 3:57 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
[ Adding eventlogging-alerts to CC ]
On Wed, Jan 07, 2015 at 12:15:09PM +0100, quelltextlich e.U. - Christian Aistleitner wrote:
Hi,
just a quick heads up that since 2015-01-07 ~1:55 only <30% of the EventLogging events are getting written to the database.
It seems a deployment went wrong and validation is no longer working as expected. Somewhere >70% of the messages no longer validate and hence do not get written to the database.
The raw log files (pre-validation) are still getting written, so data is not lost, and backfilling is possible.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Folks -- thanks for owning this. One concern -- this is the second deployment related problem in the last couple of months. I'm concerned that we need to investigate more resources in a testing environment as well as a deployment checklist. I'm also considering having EL added to Greg's deployment calendar (with the accompanying restrictions) since it's approaching being a core service.
Thoughts?
-Toby
On Wed, Jan 7, 2015 at 11:03 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
I talked about this at Scrum of Scrums, and added this image to the notes I just sent out. I said we're leaning towards not backfilling and are willing to be convinced otherwise. We'll see what people say.
On Wed, Jan 7, 2015 at 1:58 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Team:
Issues on event logging have been solved, outage of client side events (did not affected server side events) lasted about 12 hours.
Please see: http://picpaste.com/Screen_Shot_2015-01-07_at_10.50.28_AM-NsMSPgHp.png
Thanks,
Nuria
On Wed, Jan 7, 2015 at 3:57 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
[ Adding eventlogging-alerts to CC ]
On Wed, Jan 07, 2015 at 12:15:09PM +0100, quelltextlich e.U. - Christian Aistleitner wrote:
Hi,
just a quick heads up that since 2015-01-07 ~1:55 only <30% of the EventLogging events are getting written to the database.
It seems a deployment went wrong and validation is no longer working as expected. Somewhere >70% of the messages no longer validate and hence do not get written to the database.
The raw log files (pre-validation) are still getting written, so data is not lost, and backfilling is possible.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Folks -- thanks for owning this. One concern -- this is the second deployment related problem in the last couple of months. I'm concerned that we need to investigate more resources in a testing environment as well as a deployment checklist. I'm also considering having EL added to Greg's deployment calendar (with the accompanying restrictions) since it's approaching being a core service.
Thoughts?
I think all that's needed here is a deployment checklist. The problem was that a change was not tested in beta labs before being deployed. We could automate that and then take it off the checklist at a later time. But as long as it happens somehow, I think we should be safe enough.
Who is actually maintaining the EventLogging Extension now? As far as I can tell, none of the members of the Analytics-EventLogging project in Phabricator are developers. This makes it hard to know who to ping when there is a problem. For example, this EL bug that I filed a month ago was never triaged or replied to, and I'm not sure who to poke about it: https://phabricator.wikimedia.org/T78325
On Wed, Jan 7, 2015 at 11:32 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Folks -- thanks for owning this. One concern -- this is the second
deployment related problem in the last couple of months. I'm concerned that we need to investigate more resources in a testing environment as well as a deployment checklist. I'm also considering having EL added to Greg's deployment calendar (with the accompanying restrictions) since it's approaching being a core service.
Thoughts?
I think all that's needed here is a deployment checklist. The problem was that a change was not tested in beta labs before being deployed. We could automate that and then take it off the checklist at a later time. But as long as it happens somehow, I think we should be safe enough.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ryan - I'm sorry I was not aware of this. The Analytics team is responsible for Event Logging, and you can ping any of us if we're not paying attention to an issue.
Christian has been largely taking care of EL by himself, and was kept quite busy with Event Logging reliability and the need to backfill lost data. As Christian transitions away from our team, the responsibility falls on the rest of us, and I personally am getting up to speed with it. The bug you mentioned, https://phabricator.wikimedia.org/T78325, sounds like a pain and I'm happy to work on it to learn more about EL. I will bring it up with Kevin and have him respond here if it's *not* a priority.
On Wed, Jan 7, 2015 at 3:16 PM, Ryan Kaldari rkaldari@wikimedia.org wrote:
Who is actually maintaining the EventLogging Extension now? As far as I can tell, none of the members of the Analytics-EventLogging project in Phabricator are developers. This makes it hard to know who to ping when there is a problem. For example, this EL bug that I filed a month ago was never triaged or replied to, and I'm not sure who to poke about it: https://phabricator.wikimedia.org/T78325
On Wed, Jan 7, 2015 at 11:32 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Folks -- thanks for owning this. One concern -- this is the second
deployment related problem in the last couple of months. I'm concerned that we need to investigate more resources in a testing environment as well as a deployment checklist. I'm also considering having EL added to Greg's deployment calendar (with the accompanying restrictions) since it's approaching being a core service.
Thoughts?
I think all that's needed here is a deployment checklist. The problem was that a change was not tested in beta labs before being deployed. We could automate that and then take it off the checklist at a later time. But as long as it happens somehow, I think we should be safe enough.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Kaldari:
Expanding a bit to what Dan said:
We took up EL from ori's basically 6 months ago. The operational support analytics provide is documented here: https://www.mediawiki.org/wiki/EventLogging/OperationalSupport
EL has several parts and while we have not done much development on the mw extension we have done, together with ori, quite a bit of work on the server side of it as otherwise EL could not have scaled to the level its at right now:
https://github.com/wikimedia/mediawiki-extensions-EventLogging/tree/master/s...
By all means poke us about bugs you feel need more attention.
Thanks,
Nuria
On Wed, Jan 7, 2015 at 12:23 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Ryan - I'm sorry I was not aware of this. The Analytics team is responsible for Event Logging, and you can ping any of us if we're not paying attention to an issue.
Christian has been largely taking care of EL by himself, and was kept quite busy with Event Logging reliability and the need to backfill lost data. As Christian transitions away from our team, the responsibility falls on the rest of us, and I personally am getting up to speed with it. The bug you mentioned, https://phabricator.wikimedia.org/T78325, sounds like a pain and I'm happy to work on it to learn more about EL. I will bring it up with Kevin and have him respond here if it's *not* a priority.
On Wed, Jan 7, 2015 at 3:16 PM, Ryan Kaldari rkaldari@wikimedia.org wrote:
Who is actually maintaining the EventLogging Extension now? As far as I can tell, none of the members of the Analytics-EventLogging project in Phabricator are developers. This makes it hard to know who to ping when there is a problem. For example, this EL bug that I filed a month ago was never triaged or replied to, and I'm not sure who to poke about it: https://phabricator.wikimedia.org/T78325
On Wed, Jan 7, 2015 at 11:32 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Folks -- thanks for owning this. One concern -- this is the second
deployment related problem in the last couple of months. I'm concerned that we need to investigate more resources in a testing environment as well as a deployment checklist. I'm also considering having EL added to Greg's deployment calendar (with the accompanying restrictions) since it's approaching being a core service.
Thoughts?
I think all that's needed here is a deployment checklist. The problem was that a change was not tested in beta labs before being deployed. We could automate that and then take it off the checklist at a later time. But as long as it happens somehow, I think we should be safe enough.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hey Ryan, I put this bug on our agenda for our tasking meeting so we can scope it out and decide if we can commit to accomplishing it in the next sprint.
On Wed, Jan 7, 2015 at 1:46 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Kaldari:
Expanding a bit to what Dan said:
We took up EL from ori's basically 6 months ago. The operational support analytics provide is documented here: https://www.mediawiki.org/wiki/EventLogging/OperationalSupport
EL has several parts and while we have not done much development on the mw extension we have done, together with ori, quite a bit of work on the server side of it as otherwise EL could not have scaled to the level its at right now:
https://github.com/wikimedia/mediawiki-extensions-EventLogging/tree/master/s...
By all means poke us about bugs you feel need more attention.
Thanks,
Nuria
On Wed, Jan 7, 2015 at 12:23 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Ryan - I'm sorry I was not aware of this. The Analytics team is responsible for Event Logging, and you can ping any of us if we're not paying attention to an issue.
Christian has been largely taking care of EL by himself, and was kept quite busy with Event Logging reliability and the need to backfill lost data. As Christian transitions away from our team, the responsibility falls on the rest of us, and I personally am getting up to speed with it. The bug you mentioned, https://phabricator.wikimedia.org/T78325, sounds like a pain and I'm happy to work on it to learn more about EL. I will bring it up with Kevin and have him respond here if it's *not* a priority.
On Wed, Jan 7, 2015 at 3:16 PM, Ryan Kaldari rkaldari@wikimedia.org wrote:
Who is actually maintaining the EventLogging Extension now? As far as I can tell, none of the members of the Analytics-EventLogging project in Phabricator are developers. This makes it hard to know who to ping when there is a problem. For example, this EL bug that I filed a month ago was never triaged or replied to, and I'm not sure who to poke about it: https://phabricator.wikimedia.org/T78325
On Wed, Jan 7, 2015 at 11:32 AM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
Folks -- thanks for owning this. One concern -- this is the second
deployment related problem in the last couple of months. I'm concerned that we need to investigate more resources in a testing environment as well as a deployment checklist. I'm also considering having EL added to Greg's deployment calendar (with the accompanying restrictions) since it's approaching being a core service.
Thoughts?
I think all that's needed here is a deployment checklist. The problem was that a change was not tested in beta labs before being deployed. We could automate that and then take it off the checklist at a later time. But as long as it happens somehow, I think we should be safe enough.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Incident documentation updated: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150107-EventLog...
On Wed, Jan 7, 2015 at 10:58 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Team:
Issues on event logging have been solved, outage of client side events (did not affected server side events) lasted about 12 hours.
Please see: http://picpaste.com/Screen_Shot_2015-01-07_at_10.50.28_AM-NsMSPgHp.png
Thanks,
Nuria
On Wed, Jan 7, 2015 at 3:57 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
[ Adding eventlogging-alerts to CC ]
On Wed, Jan 07, 2015 at 12:15:09PM +0100, quelltextlich e.U. - Christian Aistleitner wrote:
Hi,
just a quick heads up that since 2015-01-07 ~1:55 only <30% of the EventLogging events are getting written to the database.
It seems a deployment went wrong and validation is no longer working as expected. Somewhere >70% of the messages no longer validate and hence do not get written to the database.
The raw log files (pre-validation) are still getting written, so data is not lost, and backfilling is possible.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics