Issues with clickstream data - Analytics

7 Jul 2016

(cc-ing analytics public list)

Fundraising folks:

We were talking about the problems we have had with clickstream data and
kafka as of late and how to prevent issues like this one going forward: (
https://phabricator.wikimedia.org/T132500)

We think you guys could benefit from setting up the same set of alarms on
data integrity that we have on the webrequest end and we ill be happy to
help with that at your convenience.

An example of how these alarms could work (simplified version): every
message that comes from kafka has a sequence Id, if sorted those sequence
Ids should be more or less contiguous, a gap in sequence ids indicates an
issue with data loss at the kafka source. A script checks for sequence ids
and number of records and triggers an alarm if those two do not match.

Let us know if you want to proceed with this work.

Thanks,

Nuria