Another approach we discussed back in the day was
setting up a canary
script to send known good messages whose delivery is monitored.
Aye, Jeff mentioned maybe doing that. Not a bad idea.
Jeff, aye, you are right. You wouldn’t be able to run the sequence number
check on your saved data. Sorry, I forgot that it wasn’t just the full
webrequest_text. You’d have to run another kafkatee output pipe then, to
check unsampled sequence numbers, similar to how the packet-loss.cpp script
worked with udp2log.
On Fri, Jul 8, 2016 at 11:05 AM, Toby Negrin <tnegrin(a)wikimedia.org> wrote:
Another approach we discussed back in the day was
setting up a canary
> script to send known good messages whose delivery is
monitored. This might
> be a bit easier to set up.
>
> It's been effective on other systems I've worked on; also a good way to
> measure delivery latency.
>
> -Toby
>
>
> On Friday, July 8, 2016, Jeff Green <jgreen(a)wikimedia.org> wrote:
>
>> On Fri, 8 Jul 2016, Andrew Otto wrote:
>>
>> We’ll, you won’t be able to do it exactly how we do, since we are loading
>>> the data into Hadoop and then checking it there, so we use Hadoop tools.
>>> Here’s what we got:
>>>
>>>
>>>
https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webreques…
>>>
>>>
>>>
https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webreques…
>>>
>>> This old udp2log tool did a similar thing, so it is worth knowing about:
>>>
https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-lo…
>>> However, it only worked with TSV udp2logs, and I think it won’t work with a
>>> multi-partition kafka topic, since seqs could be out of order based on
>>> partition read order.
>>>
>>> You guys do some kind of 15 (10?) minute roll ups, right? You could
>>> probably do some very rough guesses on data loss in each 15 minute bucket.
>>> You’d have to be careful though, since the order of the data is not
>>> guaranteed. We have the luxury of being over to query over our hourly
>>> buckets and assuming that all (most, really) of the data belongs in that
>>> hour bucket. But, we use Camus to read from Kafka, which handles the time
>>> bucket sorting for us.
>>>
>>
>> Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min
>> interval, and parser-script->mysql which runs on a separate system.
>>
>> Since the log files are stored one option would be to have a script that
>> runs merges several files for a longer period sample, and sort and check
>> for sequence gaps. Another option would be to modify the parse-to-mysql
>> script to do the same thing.
>>
>> But the part I don't get yet is how a script looking at output logs would
>> identify a problematic gap in sequence numbers. We have two collectors, one
>> is 1:1 and the other sampled 1:10, and both filter on the GET string. So if
>> my understanding of the sequence numbers is correct (they're per-proxy
>> right?) we should see only a small sample of sequence numbers, and how that
>> sample relates to overall traffic will vary greatly depending on
>> fundraising campaign and what else is going on on the site.
>>
>> jg
>>
>>
>>> Happy to chat more here or IRC. :)
>>>
>>> On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <jgreen(a)wikimedia.org>
wrote:
>>> Hi Nuria, thanks for raising the issue. Could you point me to the
>>> script you're using for sequence checks? I'm definitely
>>> interested in looking at how we might integrate that into
>>> fundraising monitoring.
>>>
>>> On Thu, 7 Jul 2016, Nuria Ruiz wrote:
>>>
>>> (cc-ing analytics public list)
>>> Fundraising folks:
>>>
>>> We were talking about the problems we have had with
>>> clickstream data and kafka as of late and how to prevent
>>> issues like this one going forward:
>>> (
https://phabricator.wikimedia.org/T132500)
>>>
>>> We think you guys could benefit from setting up the same set
>>> of alarms on data integrity that we have on the
>>> webrequest end and we ill be happy
>>> to help with that at your convenience.
>>>
>>> An example of how these alarms could work (simplified
>>> version): every message that comes from kafka has a
>>> sequence Id, if sorted those sequence
>>> Ids should be more or less contiguous, a gap in sequence ids
>>> indicates an issue with data loss at the kafka
>>> source. A script checks for sequence
>>> ids and number of records and triggers an alarm if those two
>>> do not match.
>>>
>>> Let us know if you want to proceed with this work.
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>>