On Tue, Aug 4, 2015 at 4:27 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
On 4 August 2015 at 04:24, Federico Leva (Nemo)
Oliver Keyes, 04/08/2015 00:12:
a lot less cautious about our sampling
A bit, perhaps, not a lot. Sampling is not just a performance matter.
Could you expand on that?
Not to speak for Nemo, but we don't want reckless abandon just because the
system won't break. Thrift is one of our values at the foundation, and if
we don't need to scale out with more hardware, we shouldn't. So I think if
data collected has value beyond the cost to wrangle it through Kafka / HDFS
/ dashboards, then it should be collected. If not, it should be sampled
until it does. There may not be an easy way to measure this so we'll have
to rely on good old subjective consensus. I promise we won't be too strict
about it, we'll just kindly ask people to think twice before collecting a
lot of data.