On Thu, Mar 20, 2014 at 3:49 PM, Toby Negrin <tnegrin@wikimedia.org> wrote:
We will work with Ori to understand what level of effort is required to support EventLogging. It's likely that Analytics and techops (and Ori) will need to collaborate on what will need to be done.

* The Ganglia scripts need to be fixed.
* A daily report should go out reporting the number of valid and invalid events logged, broken down by schema.
* Someone needs to scan that report for anything usual, file bugs for code that violates its data model, and follow-up with the relevant team to ensure a fix.
* Alerts need to be responded to.
* Once a month, the backup process (vanadium -> stat1001 -> tridge) should get a quick lookover to ensure that it is functioning.
* Once every six months, a drill should be conducted to test system failover and recovery procedures.
* There should be a designated person to provide technical advice and Gerrit code review for new instrumentation code. (This has already scaled beyond just me -- folks like Matt F, Yuvi, Jon, Bryan, etc. have the requisite expertise. But someone needs to own this, and be accountable that code review happens in a prompt fashion.)
* Bugs reported in Bugzilla should be acknowledged and resolved.

Toby, I think you guys have the requisite talent and capacity to handle it internally.