Re: [Analytics] Duplicate events in the webrequest logs

23 Feb 2015

      And, an additional point; I don't understand why, if dupes is the
problem, the Hive query was not hit as badly by this as the equivalent
UDF.
On 22 February 2015 at 18:22, Oliver Keyes okeyes@wikimedia.org wrote:
...
As previously mentioned, I've been digging into QA testing the new
pageviews definitions, and noticed a weird spike.[0] This was narrowed
down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1]
and a breakpoint was then seen at approximately 22:35 UTC.[2]
TL;DR: either I don't understand how sequence number/hostname
combinations work or there's massive duplication and sometimes
triplication happening in the webrequest table.
I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint,
coming to 6 million rows in total. My first hypothesis was that we
were looking at some form of external attack (automata, say?), but the
requests were evenly distributed between the desktop and mobile
sites,[3] were not linked to any particular user agent or class of
user agents,[4] and were not linked to any particular IP address.[5]
With that hypothesis looking tentative I instead investigated internal
snafus. The most obvious was duplicate events. As I understand it (and
I really hope I'm wrong about this), each hostname issues a
unique-to-the-host sequence number with each request, incrementing
each time. Accordingly, in a universe where we have no duplicate
events, the {hostname, sequence_number} tuples in a dataset of
requests should contain zero duplicates.
I dug into this and looked at how many duplicate tuples we had.
And...bingo. We have many, /many/ duplicate tuples, and the point at
which it reduces lines up with when the number of pageviews
reduces.[6] Moreover, the number of duplicates is not proportionate to
the number of pageviews.[7] So it looks like what we're dealing with
here is a tremendous rise in duplicate events in the webrequests
table. After the duplicate requests were removed, we ended up with a
more natural pattern.[8] IOW, a chaotic pattern matching the chaotic
pattern we see in the number of distinct IPs.
Thoughts:

I thought we had systems in place to stop this? We should be

calculating a per-host arithmetic series over the sequence numbers
when data is loaded.
2. Please tell me that my understanding of how unique sequence numbers
are is terribly terribly wrong, because the alternative is...trouble.
3. I'm not sure what this means for our "actual" pageviews, given that
as [7] shows, we still have a lot of duplicates after the artificial
spike ends.
4. How many issues do I have to ID before people take me up on my
request to be exclusively referred to as 'Count Logula'? ;)
[0] https://upload.wikimedia.org/wikipedia/commons/4/40/First_pageview_QA_test.p...
[1] https://upload.wikimedia.org/wikipedia/commons/0/02/First_Pageview_QA_test_-...
[2] https://upload.wikimedia.org/wikipedia/commons/a/a4/First_Pageview_QA_test_-...
[3] https://upload.wikimedia.org/wikipedia/commons/e/ec/27_2200_analysis_per_sou...
[4] https://upload.wikimedia.org/wikipedia/commons/d/dd/27_2200_analysis_per_age...
[5] https://upload.wikimedia.org/wikipedia/commons/5/51/27_2200_analysis_distinc...
[6] https://upload.wikimedia.org/wikipedia/commons/4/48/27_2200_analysis_duplica...
[7] https://upload.wikimedia.org/wikipedia/commons/7/7d/27_2200_analysis_duplica...
[8] https://upload.wikimedia.org/wikipedia/commons/a/a6/27_2200_analysis_de_dupl...
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Duplicate events in the webrequest logs