Hi Dario,
On Thu, Dec 11, 2014 at 04:11:49PM -0800, Dario Taraborelli wrote:
I am kicking off this thread [...]
Thanks!
However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).
Full ACK.
However, that sounds like we're only talking about schemas where the collection code got tested using Vagrant or beta, and is known to work on the relevant portion of the traffic.
And since you say that it's on browsers/platforms that we don't necessarily test for internally, I assume we're actually talking only about a small fraction of the traffic.
I assume that scope for the rest of the reply.
is there a way to inspect invalid events in near real time without having access to vanadium?
* Urgent, ad-hoc needs
For urgent, ad-hoc needs, (which should happen really seldom, given the scope), ping us in IRC in #wikimedia-analytics. At least qchris, milimetric, and nuria should be able to ssh into vanadium and can take a look right away.
If none of them are around, Ops of course have access to the relevant files on vanadium [1]. And since we're in the case of urgent, ad-hoc needs, I am sure they'd help out.
* Not so urgent needs
For not so urgent needs, since it's only a small fraction of the traffic, I am not sure real-time need is worth it.
Sure it would be nice to provide near real-time access to those files, but we should also get the cluster into a more reliable state, implement UDFs for researches to make their lives easier, and get the data-warehouse up and running ;-)
But I see that meanwhile a Phabricator task got added, and I guess I am alone with my judgement :-)
Have fun, Christian
[1] Either
/srv/log/eventlogging/client-side-events.log
or
/srv/log/eventlogging/server-side-events.log
depending on the kind of event you're looking for.