Duplicate events in the webrequest logs

List overview All Threads
Download

newer

older

Pageviews library for R

Welcome Joseph

Oliver Keyes

23 Feb 2015 23 Feb '15

7:22 a.m.

As previously mentioned, I've been digging into QA testing the new pageviews definitions, and noticed a weird spike.[0] This was narrowed down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1] and a breakpoint was then seen at approximately 22:35 UTC.[2]

TL;DR: either I don't understand how sequence number/hostname combinations work or there's massive duplication and sometimes triplication happening in the webrequest table.

I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint, coming to 6 million rows in total. My first hypothesis was that we were looking at some form of external attack (automata, say?), but the requests were evenly distributed between the desktop and mobile sites,[3] were not linked to any particular user agent or class of user agents,[4] and were not linked to any particular IP address.[5]

With that hypothesis looking tentative I instead investigated internal snafus. The most obvious was duplicate events. As I understand it (and I really hope I'm wrong about this), each hostname issues a unique-to-the-host sequence number with each request, incrementing each time. Accordingly, in a universe where we have no duplicate events, the {hostname, sequence_number} tuples in a dataset of requests should contain zero duplicates.

I dug into this and looked at how many duplicate tuples we had. And...bingo. We have many, /many/ duplicate tuples, and the point at which it reduces lines up with when the number of pageviews reduces.[6] Moreover, the number of duplicates is not proportionate to the number of pageviews.[7] So it looks like what we're dealing with here is a tremendous rise in duplicate events in the webrequests table. After the duplicate requests were removed, we ended up with a more natural pattern.[8] IOW, a chaotic pattern matching the chaotic pattern we see in the number of distinct IPs.

Thoughts: 1. I thought we had systems in place to stop this? We should be calculating a per-host arithmetic series over the sequence numbers when data is loaded. 2. Please tell me that my understanding of how unique sequence numbers are is terribly terribly wrong, because the alternative is...trouble. 3. I'm not sure what this means for our "actual" pageviews, given that as [7] shows, we still have a lot of duplicates after the artificial spike ends. 4. How many issues do I have to ID before people take me up on my request to be exclusively referred to as 'Count Logula'? ;)

-- Oliver Keyes Research Analyst Wikimedia Foundation

Show replies by date

Oliver Keyes

23 Feb 23 Feb

7:46 a.m.

And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.

On 22 February 2015 at 18:22, Oliver Keyes okeyes@wikimedia.org wrote:

...

As previously mentioned, I've been digging into QA testing the new pageviews definitions, and noticed a weird spike.[0] This was narrowed down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1] and a breakpoint was then seen at approximately 22:35 UTC.[2]

TL;DR: either I don't understand how sequence number/hostname combinations work or there's massive duplication and sometimes triplication happening in the webrequest table.

I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint, coming to 6 million rows in total. My first hypothesis was that we were looking at some form of external attack (automata, say?), but the requests were evenly distributed between the desktop and mobile sites,[3] were not linked to any particular user agent or class of user agents,[4] and were not linked to any particular IP address.[5]

With that hypothesis looking tentative I instead investigated internal snafus. The most obvious was duplicate events. As I understand it (and I really hope I'm wrong about this), each hostname issues a unique-to-the-host sequence number with each request, incrementing each time. Accordingly, in a universe where we have no duplicate events, the {hostname, sequence_number} tuples in a dataset of requests should contain zero duplicates.

I dug into this and looked at how many duplicate tuples we had. And...bingo. We have many, /many/ duplicate tuples, and the point at which it reduces lines up with when the number of pageviews reduces.[6] Moreover, the number of duplicates is not proportionate to the number of pageviews.[7] So it looks like what we're dealing with here is a tremendous rise in duplicate events in the webrequests table. After the duplicate requests were removed, we ended up with a more natural pattern.[8] IOW, a chaotic pattern matching the chaotic pattern we see in the number of distinct IPs.

Thoughts:

I thought we had systems in place to stop this? We should be

calculating a per-host arithmetic series over the sequence numbers when data is loaded. 2. Please tell me that my understanding of how unique sequence numbers are is terribly terribly wrong, because the alternative is...trouble. 3. I'm not sure what this means for our "actual" pageviews, given that as [7] shows, we still have a lot of duplicates after the artificial spike ends. 4. How many issues do I have to ID before people take me up on my request to be exclusively referred to as 'Count Logula'? ;)

[0] https://upload.wikimedia.org/wikipedia/commons/4/40/First_pageview_QA_test.p... [1] https://upload.wikimedia.org/wikipedia/commons/0/02/First_Pageview_QA_test_-... [2] https://upload.wikimedia.org/wikipedia/commons/a/a4/First_Pageview_QA_test_-... [3] https://upload.wikimedia.org/wikipedia/commons/e/ec/27_2200_analysis_per_sou... [4] https://upload.wikimedia.org/wikipedia/commons/d/dd/27_2200_analysis_per_age... [5] https://upload.wikimedia.org/wikipedia/commons/5/51/27_2200_analysis_distinc... [6] https://upload.wikimedia.org/wikipedia/commons/4/48/27_2200_analysis_duplica... [7] https://upload.wikimedia.org/wikipedia/commons/7/7d/27_2200_analysis_duplica... [8] https://upload.wikimedia.org/wikipedia/commons/a/a6/27_2200_analysis_de_dupl...

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Christian Aistleitner

7:22 p.m.

Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:

...

And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.

just shooting in the dark, since you did not provide your query, but if you by accident had been querying the

wmf_raw.webrequest

(database name ending in “_raw”) table instead of

wmf.webrequest

(no “_raw” in the database name), the difference you described would be plausible (and given the patching of GHOST, they'd even be expected).

Have fun, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

Dan Andreescu

10:01 p.m.

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:

...

Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:

...
And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.

just shooting in the dark, since you did not provide your query, but if you by accident had been querying the

wmf_raw.webrequest

(database name ending in “_raw”) table instead of

wmf.webrequest

(no “_raw” in the database name), the difference you described would be plausible (and given the patching of GHOST, they'd even be expected).

Have fun, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Andrew Otto

10:35 p.m.

...

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

Duplicates are already cleaned up, in the refined table. There should never be any duplicates in the wmf.webrequest table.

https://gerrit.wikimedia.org/r/#/c/177522/ https://gerrit.wikimedia.org/r/#/c/177522/

Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.

...

We should be calculating a per-host arithmetic series over the sequence numbers when data is loaded.

Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses.

-Ao

...

On Feb 23, 2015, at 09:01, Dan Andreescu dandreescu@wikimedia.org wrote:

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner <christian@quelltextlich.at mailto:christian@quelltextlich.at> wrote: Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:

...
And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.

just shooting in the dark, since you did not provide your query, but if you by accident had been querying the

wmf_raw.webrequest

(database name ending in “_raw”) table instead of

wmf.webrequest

(no “_raw” in the database name), the difference you described would be plausible (and given the patching of GHOST, they'd even be expected).

Have fun, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at mailto:christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 tel:%2B43%207946%20%2F%2020%205%2081 Fax: +43 7946 / 20 5 81 tel:%2B43%207946%20%2F%2020%205%2081 Homepage: http://quelltextlich.at/ http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

10:44 p.m.

Aha, so wmf_raw.webrequest is expected to have duplicates? Okay! That could do it :). I'll re-run across wmf.webrequest; thanks Christian for the spot, and Andrew for having thought 3 stages ahead as usual :D

On 23 February 2015 at 09:35, Andrew Otto aotto@wikimedia.org wrote:

...

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

Duplicates are already cleaned up, in the refined table. There should never be any duplicates in the wmf.webrequest table.

https://gerrit.wikimedia.org/r/#/c/177522/

Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.

We should be calculating a per-host arithmetic series over the sequence numbers when data is loaded.

Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses.

-Ao

On Feb 23, 2015, at 09:01, Dan Andreescu dandreescu@wikimedia.org wrote:

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner christian@quelltextlich.at wrote:

...
Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:

...
And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.

just shooting in the dark, since you did not provide your query, but if you by accident had been querying the

wmf_raw.webrequest

(database name ending in “_raw”) table instead of

wmf.webrequest

(no “_raw” in the database name), the difference you described would be plausible (and given the patching of GHOST, they'd even be expected).

Have fun, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

11:14 p.m.

Sorry - I'm a dope :)

On Mon, Feb 23, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:

...

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

Duplicates are already cleaned up, in the refined table. There should never be any duplicates in the wmf.webrequest table.

https://gerrit.wikimedia.org/r/#/c/177522/

Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.

We should be calculating a per-host arithmetic series over the sequence numbers when data is loaded.

Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses.

-Ao

On Feb 23, 2015, at 09:01, Dan Andreescu dandreescu@wikimedia.org wrote:

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:

...
Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:

...
And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.

just shooting in the dark, since you did not provide your query, but if you by accident had been querying the

wmf_raw.webrequest

(database name ending in “_raw”) table instead of

wmf.webrequest

(no “_raw” in the database name), the difference you described would be plausible (and given the patching of GHOST, they'd even be expected).

Have fun, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

11:20 p.m.

Nope, just dope, not /a/ dope!

On 23 February 2015 at 10:14, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Sorry - I'm a dope :)

On Mon, Feb 23, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:

...
We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

Duplicates are already cleaned up, in the refined table. There should never be any duplicates in the wmf.webrequest table.

https://gerrit.wikimedia.org/r/#/c/177522/

Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.

We should be calculating a per-host arithmetic series over the sequence numbers when data is loaded.

Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses.

-Ao

On Feb 23, 2015, at 09:01, Dan Andreescu dandreescu@wikimedia.org wrote:

We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner christian@quelltextlich.at wrote:

...
Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:

...
And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.

just shooting in the dark, since you did not provide your query, but if you by accident had been querying the

wmf_raw.webrequest

(database name ending in “_raw”) table instead of

wmf.webrequest

(no “_raw” in the database name), the difference you described would be plausible (and given the patching of GHOST, they'd even be expected).

Have fun, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Christian Aistleitner

11:31 p.m.

Hi Andrew,

On Mon, Feb 23, 2015 at 09:35:48AM -0500, Andrew Otto wrote:

...

https://gerrit.wikimedia.org/r/#/c/177522/ https://gerrit.wikimedia.org/r/#/c/177522/

Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.

That should not be the case.

Back when you decided that deduplication should happen during refining from wmf_raw.webrequest to wmf.webrequest, and the above change got implemented, all of 2015 got deduped and backfilled on wmf.webrequest.

So all of 2015 in wmf.webrequest is deduped (with the known limitations).

Have fun, Christian

P.S.: And all the wmf.webrequest based jobs from

https://commons.wikimedia.org/w/index.php?title=File:Refinery-oozie-overview...

that exist for 2015 got re-run on this deduped data too.

So no dupes for the corresponding legacy tsvs, pagecounts-all-sites, ...

Oliver Keyes

11:34 p.m.

...and no dupes in the refined pagecounts_all_sites table, which would explain the delta. Aha :D. Let's see what we see!

On 23 February 2015 at 10:31, Christian Aistleitner christian@quelltextlich.at wrote:

...

Hi Andrew,

On Mon, Feb 23, 2015 at 09:35:48AM -0500, Andrew Otto wrote:

...
https://gerrit.wikimedia.org/r/#/c/177522/ https://gerrit.wikimedia.org/r/#/c/177522/

Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.

That should not be the case.

Back when you decided that deduplication should happen during refining from wmf_raw.webrequest to wmf.webrequest, and the above change got implemented, all of 2015 got deduped and backfilled on wmf.webrequest.

So all of 2015 in wmf.webrequest is deduped (with the known limitations).

Have fun, Christian

P.S.: And all the wmf.webrequest based jobs from

https://commons.wikimedia.org/w/index.php?title=File:Refinery-oozie-overview...

that exist for 2015 got re-run on this deduped data too.

So no dupes for the corresponding legacy tsvs, pagecounts-all-sites, ...

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Christian Aistleitner

7:16 p.m.

Hi Oliver,

On Sun, Feb 22, 2015 at 06:22:25PM -0500, Oliver Keyes wrote:

...

This was narrowed down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1] and a breakpoint was then seen at approximately 22:35 UTC.[2]

Yay for isolating the time.

...

[ We see duplicates! ]

How and where did you detect the duplicates? How can one reproduce your observation?

...

As I understand it (and I really hope I'm wrong about this), each hostname issues a unique-to-the-host sequence number with each request, incrementing each time.

The sequence numbers are not unique per host.

For varnishes, the sequence numbers should be unique per tuple: (host, service_start_time, sequence_number_overflow_count, logging_backend).

For nginx (back then still available on udp2log IIRC), the numbers are not unique at all.

Have fun, Christian

P.S.: I had a look at the relevant hour in the wmf.webrequest table, and I could not detect duplicates [*].

P.P.S.: On 2015-01-27, GHOST [**] got patched on WMF servers. That accounted for quite some restarts and config changes. So expect many sequence number resets during that day.

[*] (See the one_hour_stats_all_sources.hql attachment) _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 10:54:08 // exit code: 0 cwd: ~ hive -f one_hour_stats_all_sources.hql -d table=wmf.webrequest -d year=2015 -d month=1 -d day=27 -d hour=22 [...] all_hosts.hostname [...] all_hosts.count_duplicate [...] cp4016.ulsfo.wmnet [...] 0 [...] amssq46.esams.wmnet [...] 0 [...] cp3005.esams.wikimedia.org [...] 0 [...] amssq59.esams.wikimedia.org [...] 0 [...]

[**] https://access.redhat.com/articles/1332213

3561

Age (days ago)

3562

Last active (days ago)

analytics@lists.wikimedia.org

10 comments

4 participants

tags (0)

participants (4)

Andrew Otto
Christian Aistleitner
Dan Andreescu
Oliver Keyes