Well, we consume our Kafka streams into HDFS and check the sequence numbers with Hive through Oozie, the jobs and scripts are here:

https://github.com/wikimedia/analytics-refinery/tree/master/oozie/webrequest/load

So it's a bit more complicated and not directly useful to your data flow (Kafkatee -> Mysql, right?). But we'd love to help you get familiar with the code and approach. This script computes the stats and puts them in wmf.webrequest_sequence_stats:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql

This is then aggregated hourly, and checked by this workflow, which sends emails if it sees problems:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/check_sequence_statistics_workflow.xml

We can then use information about data quality for each hour to re-run jobs, postpone jobs that would compute bad data, and so on. And we do some of that, but we've changed it a bit over the years so if you'd like more detail you can grab someone like Joseph and have a quick meeting.

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <jgreen@wikimedia.org> wrote:

Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.

On Thu, 7 Jul 2016, Nuria Ruiz wrote:

(cc-ing analytics public list)
Fundraising folks:

We were talking about the problems we have had with clickstream data and kafka as of late and how to prevent issues like this one going forward:
(https://phabricator.wikimedia.org/T132500)

We think you guys could benefit from setting up the same set of alarms on data integrity that we have on the webrequest end and we ill be happy
to help with that at your convenience.

An example of how these alarms could work (simplified version): every message that comes from kafka has a sequence Id, if sorted those sequence
Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence
ids and number of records and triggers an alarm if those two do not match.

Let us know if you want to proceed with this work.

Thanks,

Nuria

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics