Thanks Ori for pushing us on this. EventLogging is one of my primary tools
for getting things done, so it's very important to me that the system is
well supported.
On Fri, Mar 21, 2014 at 12:35 AM, Ori Livneh <ori(a)wikimedia.org> wrote:
On Thu, Mar 20, 2014 at 3:49 PM, Toby Negrin
<tnegrin(a)wikimedia.org>wrote;wrote:
We will work with Ori to understand what level of
effort is required to
support EventLogging. It's likely that Analytics and techops (and Ori) will
need to collaborate on what will need to be done.
* The Ganglia scripts need to be fixed.
* A daily report should go out reporting the number of valid and invalid
events logged, broken down by schema.
* Someone needs to scan that report for anything usual, file bugs for code
that violates its data model, and follow-up with the relevant team to
ensure a fix.
* Alerts need to be responded to.
* Once a month, the backup process (vanadium -> stat1001 -> tridge) should
get a quick lookover to ensure that it is functioning.
* Once every six months, a drill should be conducted to test system
failover and recovery procedures.
* There should be a designated person to provide technical advice and
Gerrit code review for new instrumentation code. (This has already scaled
beyond just me -- folks like Matt F, Yuvi, Jon, Bryan, etc. have the
requisite expertise. But someone needs to own this, and be accountable that
code review happens in a prompt fashion.)
* Bugs reported in Bugzilla should be acknowledged and resolved.
Toby, I think you guys have the requisite talent and capacity to handle it
internally.
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics