Unexpected down time for ORES and unscheduled deployment right now - AI

25 Sep 2016


      Today ORES in production was sending out unreasonable amount of timeout
errors. Causing icinga to scream and 14% failure rate on average for ORES
review tool jobs. It turned out that ores workers are logging too much
causing the nodes to run out of disk space. [1] I suspect we had similar
issue in our labs nodes.
I made changes for prod and labs and deployed it today. You can find more
details in the phab card
[1]: https://phabricator.wikimedia.org/T146581
Cheers
Best