Issues with clickstream data

List overview All Threads
Download

newer

older

Clickhouse

Re: [Analytics] [Wiki-research-l]...

Nuria Ruiz

7 Jul 2016 7 Jul '16

8:32 p.m.

(cc-ing analytics public list)

Fundraising folks:

We were talking about the problems we have had with clickstream data and kafka as of late and how to prevent issues like this one going forward: ( https://phabricator.wikimedia.org/T132500)

We think you guys could benefit from setting up the same set of alarms on data integrity that we have on the webrequest end and we ill be happy to help with that at your convenience.

An example of how these alarms could work (simplified version): every message that comes from kafka has a sequence Id, if sorted those sequence Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence ids and number of records and triggers an alarm if those two do not match.

Let us know if you want to proceed with this work.

Thanks,

Nuria

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Jeff Green

8 Jul 8 Jul

3:17 p.m.

Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.

On Thu, 7 Jul 2016, Nuria Ruiz wrote:

...

(cc-ing analytics public list) Fundraising folks:

We were talking about the problems we have had with clickstream data and kafka as of late and how to prevent issues like this one going forward: (https://phabricator.wikimedia.org/T132500)

We think you guys could benefit from setting up the same set of alarms on data integrity that we have on the webrequest end and we ill be happy to help with that at your convenience.

An example of how these alarms could work (simplified version): every message that comes from kafka has a sequence Id, if sorted those sequence Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence ids and number of records and triggers an alarm if those two do not match.

Let us know if you want to proceed with this work.

Thanks,

Nuria

Dan Andreescu

3:34 p.m.

Well, we consume our Kafka streams into HDFS and check the sequence numbers with Hive through Oozie, the jobs and scripts are here:

https://github.com/wikimedia/analytics-refinery/tree/master/oozie/webrequest...

So it's a bit more complicated and not directly useful to your data flow (Kafkatee -> Mysql, right?). But we'd love to help you get familiar with the code and approach. This script computes the stats and puts them in wmf.webrequest_sequence_stats:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

This is then aggregated hourly, and checked by this workflow, which sends emails if it sees problems:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

We can then use information about data quality for each hour to re-run jobs, postpone jobs that would compute bad data, and so on. And we do some of that, but we've changed it a bit over the years so if you'd like more detail you can grab someone like Joseph and have a quick meeting.

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green jgreen@wikimedia.org wrote:

...

Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.

On Thu, 7 Jul 2016, Nuria Ruiz wrote:

(cc-ing analytics public list)

...
Fundraising folks:

We were talking about the problems we have had with clickstream data and kafka as of late and how to prevent issues like this one going forward: (https://phabricator.wikimedia.org/T132500)

We think you guys could benefit from setting up the same set of alarms on data integrity that we have on the webrequest end and we ill be happy to help with that at your convenience.

An example of how these alarms could work (simplified version): every message that comes from kafka has a sequence Id, if sorted those sequence Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence ids and number of records and triggers an alarm if those two do not match.

Let us know if you want to proceed with this work.

Thanks,

Nuria

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Andrew Otto

3:34 p.m.

We’ll, you won’t be able to do it exactly how we do, since we are loading the data into Hadoop and then checking it there, so we use Hadoop tools. Here’s what we got:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

This old udp2log tool did a similar thing, so it is worth knowing about: https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-los... However, it only worked with TSV udp2logs, and I think it won’t work with a multi-partition kafka topic, since seqs could be out of order based on partition read order.

You guys do some kind of 15 (10?) minute roll ups, right? You could probably do some very rough guesses on data loss in each 15 minute bucket. You’d have to be careful though, since the order of the data is not guaranteed. We have the luxury of being over to query over our hourly buckets and assuming that all (most, really) of the data belongs in that hour bucket. But, we use Camus to read from Kafka, which handles the time bucket sorting for us.

Happy to chat more here or IRC. :)

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green jgreen@wikimedia.org wrote:

...

Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.

On Thu, 7 Jul 2016, Nuria Ruiz wrote:

(cc-ing analytics public list)

...
Fundraising folks:

We were talking about the problems we have had with clickstream data and kafka as of late and how to prevent issues like this one going forward: (https://phabricator.wikimedia.org/T132500)

We think you guys could benefit from setting up the same set of alarms on data integrity that we have on the webrequest end and we ill be happy to help with that at your convenience.

An example of how these alarms could work (simplified version): every message that comes from kafka has a sequence Id, if sorted those sequence Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence ids and number of records and triggers an alarm if those two do not match.

Let us know if you want to proceed with this work.

Thanks,

Nuria

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jeff Green

4:13 p.m.

On Fri, 8 Jul 2016, Andrew Otto wrote:

...

We’ll, you won’t be able to do it exactly how we do, since we are loading the data into Hadoop and then checking it there, so we use Hadoop tools. Here’s what we got:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

This old udp2log tool did a similar thing, so it is worth knowing about: https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-los... However, it only worked with TSV udp2logs, and I think it won’t work with a multi-partition kafka topic, since seqs could be out of order based on partition read order.

You guys do some kind of 15 (10?) minute roll ups, right? You could probably do some very rough guesses on data loss in each 15 minute bucket. You’d have to be careful though, since the order of the data is not guaranteed. We have the luxury of being over to query over our hourly buckets and assuming that all (most, really) of the data belongs in that hour bucket. But, we use Camus to read from Kafka, which handles the time bucket sorting for us.

Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min interval, and parser-script->mysql which runs on a separate system.

Since the log files are stored one option would be to have a script that runs merges several files for a longer period sample, and sort and check for sequence gaps. Another option would be to modify the parse-to-mysql script to do the same thing.

But the part I don't get yet is how a script looking at output logs would identify a problematic gap in sequence numbers. We have two collectors, one is 1:1 and the other sampled 1:10, and both filter on the GET string. So if my understanding of the sequence numbers is correct (they're per-proxy right?) we should see only a small sample of sequence numbers, and how that sample relates to overall traffic will vary greatly depending on fundraising campaign and what else is going on on the site.

...

Happy to chat more here or IRC. :)

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green jgreen@wikimedia.org wrote: Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.

  On Thu, 7 Jul 2016, Nuria Ruiz wrote:

        (cc-ing analytics public list)
        Fundraising folks:

        We were talking about the problems we have had with clickstream data and kafka as of late and how to prevent
        issues like this one going forward:
        (https://phabricator.wikimedia.org/T132500)

        We think you guys could benefit from setting up the same set of alarms on data integrity that we have on the
        webrequest end and we ill be happy
        to help with that at your convenience. 

        An example of how these alarms could work (simplified version): every message that comes from kafka has a
        sequence Id, if sorted those sequence
        Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka
        source. A script checks for sequence
        ids and number of records and triggers an alarm if those two do not match.

        Let us know if you want to proceed with this work.

        Thanks,

        Nuria

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

5:05 p.m.

Another approach we discussed back in the day was setting up a canary script to send known good messages whose delivery is monitored. This might be a bit easier to set up.

It's been effective on other systems I've worked on; also a good way to measure delivery latency.

-Toby

On Friday, July 8, 2016, Jeff Green jgreen@wikimedia.org wrote:

...

On Fri, 8 Jul 2016, Andrew Otto wrote:

We’ll, you won’t be able to do it exactly how we do, since we are loading

...
the data into Hadoop and then checking it there, so we use Hadoop tools. Here’s what we got:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

This old udp2log tool did a similar thing, so it is worth knowing about: https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-los... However, it only worked with TSV udp2logs, and I think it won’t work with a multi-partition kafka topic, since seqs could be out of order based on partition read order.

You guys do some kind of 15 (10?) minute roll ups, right? You could probably do some very rough guesses on data loss in each 15 minute bucket. You’d have to be careful though, since the order of the data is not guaranteed. We have the luxury of being over to query over our hourly buckets and assuming that all (most, really) of the data belongs in that hour bucket. But, we use Camus to read from Kafka, which handles the time bucket sorting for us.

Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min interval, and parser-script->mysql which runs on a separate system.

Since the log files are stored one option would be to have a script that runs merges several files for a longer period sample, and sort and check for sequence gaps. Another option would be to modify the parse-to-mysql script to do the same thing.

But the part I don't get yet is how a script looking at output logs would identify a problematic gap in sequence numbers. We have two collectors, one is 1:1 and the other sampled 1:10, and both filter on the GET string. So if my understanding of the sequence numbers is correct (they're per-proxy right?) we should see only a small sample of sequence numbers, and how that sample relates to overall traffic will vary greatly depending on fundraising campaign and what else is going on on the site.

jg

...
Happy to chat more here or IRC. :)

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green jgreen@wikimedia.org wrote: Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.
  On Thu, 7 Jul 2016, Nuria Ruiz wrote:

        (cc-ing analytics public list)
        Fundraising folks:

        We were talking about the problems we have had with
clickstream data and kafka as of late and how to prevent issues like this one going forward: (https://phabricator.wikimedia.org/T132500)
        We think you guys could benefit from setting up the same set
of alarms on data integrity that we have on the webrequest end and we ill be happy to help with that at your convenience.
        An example of how these alarms could work (simplified
version): every message that comes from kafka has a sequence Id, if sorted those sequence Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence ids and number of records and triggers an alarm if those two do not match.
        Let us know if you want to proceed with this work.

        Thanks,

        Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Andrew Otto

5:14 p.m.

...

Another approach we discussed back in the day was setting up a canary

script to send known good messages whose delivery is monitored. Aye, Jeff mentioned maybe doing that. Not a bad idea.

Jeff, aye, you are right. You wouldn’t be able to run the sequence number check on your saved data. Sorry, I forgot that it wasn’t just the full webrequest_text. You’d have to run another kafkatee output pipe then, to check unsampled sequence numbers, similar to how the packet-loss.cpp script worked with udp2log.

On Fri, Jul 8, 2016 at 11:05 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...

Another approach we discussed back in the day was setting up a canary script to send known good messages whose delivery is monitored. This might be a bit easier to set up.

It's been effective on other systems I've worked on; also a good way to measure delivery latency.

-Toby

On Friday, July 8, 2016, Jeff Green jgreen@wikimedia.org wrote:

...
On Fri, 8 Jul 2016, Andrew Otto wrote:

We’ll, you won’t be able to do it exactly how we do, since we are loading

...
the data into Hadoop and then checking it there, so we use Hadoop tools. Here’s what we got:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest...

This old udp2log tool did a similar thing, so it is worth knowing about: https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-los... However, it only worked with TSV udp2logs, and I think it won’t work with a multi-partition kafka topic, since seqs could be out of order based on partition read order.

You guys do some kind of 15 (10?) minute roll ups, right? You could probably do some very rough guesses on data loss in each 15 minute bucket. You’d have to be careful though, since the order of the data is not guaranteed. We have the luxury of being over to query over our hourly buckets and assuming that all (most, really) of the data belongs in that hour bucket. But, we use Camus to read from Kafka, which handles the time bucket sorting for us.

Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min interval, and parser-script->mysql which runs on a separate system.

Since the log files are stored one option would be to have a script that runs merges several files for a longer period sample, and sort and check for sequence gaps. Another option would be to modify the parse-to-mysql script to do the same thing.

But the part I don't get yet is how a script looking at output logs would identify a problematic gap in sequence numbers. We have two collectors, one is 1:1 and the other sampled 1:10, and both filter on the GET string. So if my understanding of the sequence numbers is correct (they're per-proxy right?) we should see only a small sample of sequence numbers, and how that sample relates to overall traffic will vary greatly depending on fundraising campaign and what else is going on on the site.

jg

...
Happy to chat more here or IRC. :)

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green jgreen@wikimedia.org wrote: Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence checks? I'm definitely interested in looking at how we might integrate that into fundraising monitoring.
  On Thu, 7 Jul 2016, Nuria Ruiz wrote:

        (cc-ing analytics public list)
        Fundraising folks:

        We were talking about the problems we have had with
clickstream data and kafka as of late and how to prevent issues like this one going forward: (https://phabricator.wikimedia.org/T132500)
        We think you guys could benefit from setting up the same set
of alarms on data integrity that we have on the webrequest end and we ill be happy to help with that at your convenience.
        An example of how these alarms could work (simplified
version): every message that comes from kafka has a sequence Id, if sorted those sequence Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data loss at the kafka source. A script checks for sequence ids and number of records and triggers an alarm if those two do not match.
        Let us know if you want to proceed with this work.

        Thanks,

        Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jeff Green

9:08 p.m.

It sounds like a canary/heartbeat approach is the best fit for the fundraising scenario, we'll put that in the hopper. Thanks for all your feedback everyone!

On Fri, 8 Jul 2016, Andrew Otto wrote:

...

Another approach we discussed back in the day was setting up a canary script to send known good messages whose delivery

is monitored. Aye, Jeff mentioned maybe doing that. Not a bad idea.

On Fri, Jul 8, 2016 at 11:05 AM, Toby Negrin tnegrin@wikimedia.org wrote: Another approach we discussed back in the day was setting up a canary script to send known good messages whose delivery is monitored. This might be a bit easier to set up. It's been effective on other systems I've worked on; also a good way to measure delivery latency.

-Toby

On Friday, July 8, 2016, Jeff Green jgreen@wikimedia.org wrote: On Fri, 8 Jul 2016, Andrew Otto wrote:

        We’ll, you won’t be able to do it exactly how we do, since we are loading the data into Hadoop and then
        checking it there, so we use Hadoop tools.  Here’s what we got:

        https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql

        https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql

        This old udp2log tool did a similar thing, so it is worth knowing about:
        https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cpp However, it only
        worked with TSV udp2logs, and I think it won’t work with a multi-partition kafka topic, since seqs could
        be out of order based on partition read order.

        You guys do some kind of 15 (10?) minute roll ups, right?  You could probably do some very rough guesses
        on data loss in each 15 minute bucket.  You’d have to be careful though, since the order of the data is
        not guaranteed.  We have the luxury of being over to query over our hourly buckets and assuming that all
        (most, really) of the data belongs in that hour bucket.  But, we use Camus to read from Kafka, which
        handles the time bucket sorting for us.


  Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min interval, and parser-script->mysql which runs on a
  separate system.

  Since the log files are stored one option would be to have a script that runs merges several files for a longer
  period sample, and sort and check for sequence gaps. Another option would be to modify the parse-to-mysql script to
  do the same thing.

  But the part I don't get yet is how a script looking at output logs would identify a problematic gap in sequence
  numbers. We have two collectors, one is 1:1 and the other sampled 1:10, and both filter on the GET string. So if my
  understanding of the sequence numbers is correct (they're per-proxy right?) we should see only a small sample of
  sequence numbers, and how that sample relates to overall traffic will vary greatly depending on fundraising campaign
  and what else is going on on the site.

  jg


        Happy to chat more here or IRC. :)

        On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <jgreen@wikimedia.org> wrote:
              Hi Nuria, thanks for raising the issue. Could you point me to the script you're using for sequence
        checks? I'm definitely
              interested in looking at how we might integrate that into fundraising monitoring.

              On Thu, 7 Jul 2016, Nuria Ruiz wrote:

                    (cc-ing analytics public list)
                    Fundraising folks:

                    We were talking about the problems we have had with clickstream data and kafka as of late
        and how to prevent
                    issues like this one going forward:
                    (https://phabricator.wikimedia.org/T132500)

                    We think you guys could benefit from setting up the same set of alarms on data integrity
        that we have on the
                    webrequest end and we ill be happy
                    to help with that at your convenience. 

                    An example of how these alarms could work (simplified version): every message that comes
        from kafka has a
                    sequence Id, if sorted those sequence
                    Ids should be more or less contiguous, a gap in sequence ids indicates an issue with data
        loss at the kafka
                    source. A script checks for sequence
                    ids and number of records and triggers an alarm if those two do not match.

                    Let us know if you want to proceed with this work.

                    Thanks,

                    Nuria


        _______________________________________________
        Analytics mailing list
        Analytics@lists.wikimedia.org
        https://lists.wikimedia.org/mailman/listinfo/analytics

3079

Age (days ago)

3080

Last active (days ago)

analytics@lists.wikimedia.org

7 comments

5 participants

tags (0)

participants (5)

Andrew Otto
Dan Andreescu
Jeff Green
Nuria Ruiz
Toby Negrin