On Tue, Aug 4, 2015 at 4:27 AM, Oliver Keyes <okeyes@wikimedia.org> wrote:

On 4 August 2015 at 04:24, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:
> Oliver Keyes, 04/08/2015 00:12:
>>
>> a lot less cautious about our sampling
>> rate!
>
>
> A bit, perhaps, not a lot. Sampling is not just a performance matter.
>

Could you expand on that?

Not to speak for Nemo, but we don't want reckless abandon just because the system won't break. Thrift is one of our values at the foundation, and if we don't need to scale out with more hardware, we shouldn't. So I think if data collected has value beyond the cost to wrangle it through Kafka / HDFS / dashboards, then it should be collected. If not, it should be sampled until it does. There may not be an easy way to measure this so we'll have to rely on good old subjective consensus. I promise we won't be too strict about it, we'll just kindly ask people to think twice before collecting a lot of data.