We’ll, you won’t be able to do it exactly how we do, since we are loading the data into Hadoop and then checking it there, so we use Hadoop tools.  Here’s what we got:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql

This old udp2log tool did a similar thing, so it is worth knowing about: https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cpp

However, it only worked with TSV udp2logs, and I think it won’t work with a multi-partition kafka topic, since seqs could be out of order based on partition read order.

You guys do some kind of 15 (10?) minute roll ups, right? You could probably do some very rough guesses on data loss in each 15 minute bucket. You’d have to be careful though, since the order of the data is not guaranteed. We have the luxury of being over to query over our hourly buckets and assuming that all (most, really) of the data belongs in that hour bucket. But, we use Camus to read from Kafka, which handles the time bucket sorting for us.

Happy to chat more here or IRC. :)

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <jgreen@wikimedia.org> wrote:

Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.

On Thu, 7 Jul 2016, Nuria Ruiz wrote:

(cc-ing analytics public list)
Fundraising folks:

We were talking about the problems we have had with clickstream data and kafka as of late and how to prevent issues like this one going forward:
(https://phabricator.wikimedia.org/T132500)

We think you guys could benefit from setting up the same set of alarms on data integrity that we have on the webrequest end and we ill be happy
to help with that at your convenience.

An example of how these alarms could work (simplified version): every message that comes from kafka has a sequence Id, if sorted those sequence
Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence
ids and number of records and triggers an alarm if those two do not match.

Let us know if you want to proceed with this work.

Thanks,

Nuria

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics