We’ll, you won’t be able to do it exactly how we do, since we are loading the data into Hadoop and then checking it there, so we use Hadoop tools. Here’s what we got:
However, it only worked with TSV udp2logs, and I think it won’t work with a multi-partition kafka topic, since seqs could be out of order based on partition read order.
You guys do some kind of 15 (10?) minute roll ups, right? You could probably do some very rough guesses on data loss in each 15 minute bucket. You’d have to be careful though, since the order of the data is not guaranteed. We have the luxury of being over to query over our hourly buckets and assuming that all (most, really) of the data belongs in that hour bucket. But, we use Camus to read from Kafka, which handles the time bucket sorting for us.
Happy to chat more here or IRC. :)