And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.
On 22 February 2015 at 18:22, Oliver Keyes okeyes@wikimedia.org wrote:
As previously mentioned, I've been digging into QA testing the new pageviews definitions, and noticed a weird spike.[0] This was narrowed down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1] and a breakpoint was then seen at approximately 22:35 UTC.[2]
TL;DR: either I don't understand how sequence number/hostname combinations work or there's massive duplication and sometimes triplication happening in the webrequest table.
I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint, coming to 6 million rows in total. My first hypothesis was that we were looking at some form of external attack (automata, say?), but the requests were evenly distributed between the desktop and mobile sites,[3] were not linked to any particular user agent or class of user agents,[4] and were not linked to any particular IP address.[5]
With that hypothesis looking tentative I instead investigated internal snafus. The most obvious was duplicate events. As I understand it (and I really hope I'm wrong about this), each hostname issues a unique-to-the-host sequence number with each request, incrementing each time. Accordingly, in a universe where we have no duplicate events, the {hostname, sequence_number} tuples in a dataset of requests should contain zero duplicates.
I dug into this and looked at how many duplicate tuples we had. And...bingo. We have many, /many/ duplicate tuples, and the point at which it reduces lines up with when the number of pageviews reduces.[6] Moreover, the number of duplicates is not proportionate to the number of pageviews.[7] So it looks like what we're dealing with here is a tremendous rise in duplicate events in the webrequests table. After the duplicate requests were removed, we ended up with a more natural pattern.[8] IOW, a chaotic pattern matching the chaotic pattern we see in the number of distinct IPs.
Thoughts:
- I thought we had systems in place to stop this? We should be
calculating a per-host arithmetic series over the sequence numbers when data is loaded. 2. Please tell me that my understanding of how unique sequence numbers are is terribly terribly wrong, because the alternative is...trouble. 3. I'm not sure what this means for our "actual" pageviews, given that as [7] shows, we still have a lot of duplicates after the artificial spike ends. 4. How many issues do I have to ID before people take me up on my request to be exclusively referred to as 'Count Logula'? ;)
[0] https://upload.wikimedia.org/wikipedia/commons/4/40/First_pageview_QA_test.p... [1] https://upload.wikimedia.org/wikipedia/commons/0/02/First_Pageview_QA_test_-... [2] https://upload.wikimedia.org/wikipedia/commons/a/a4/First_Pageview_QA_test_-... [3] https://upload.wikimedia.org/wikipedia/commons/e/ec/27_2200_analysis_per_sou... [4] https://upload.wikimedia.org/wikipedia/commons/d/dd/27_2200_analysis_per_age... [5] https://upload.wikimedia.org/wikipedia/commons/5/51/27_2200_analysis_distinc... [6] https://upload.wikimedia.org/wikipedia/commons/4/48/27_2200_analysis_duplica... [7] https://upload.wikimedia.org/wikipedia/commons/7/7d/27_2200_analysis_duplica... [8] https://upload.wikimedia.org/wikipedia/commons/a/a6/27_2200_analysis_de_dupl...
-- Oliver Keyes Research Analyst Wikimedia Foundation