I confirmed this on IRC, but just feeding the
archives here. I'm also
convinced that the client IP hashing bug we just found explains this
problem. It's good we took a look at the other problems, but the main one
seems the IP hashing. We'll brain bounce more tomorrow on how to fix that.
On Tue, Sep 15, 2015 at 6:23 PM, Oliver Keyes <okeyes(a)wikimedia.org>
wrote:
Update; I read Dan's thread about hashing,
read this thread, and a
penny dropped ;).
This is totally explainable by the fact that we /expect/ to see
multiple pageIDs per IP. And we are! The hashing problem just means
those aren't /appearing/ to be the same IP.
On 15 September 2015 at 18:05, Erik Bernhardson
<ebernhardson(a)wikimedia.org> wrote:
We've deployed the change to bucketing, but
we are still seeing the
same
issue in the collected data.
Again we are generating a unique 64 bit random number when the user
gets to
the page. We are seeing this same 64 bit unique
number being reported
by
multiple ip addresses.
Since deploying the new schema number with the updated bucket
selection we
have seen 13 distinct tokens coming from 42
distinct ip addresses. This
shouldn't be possible.
mysql:research@analytics-store.eqiad.wmnet [log]> select
count(distinct
clientIp) from CompletionSugges
tions_13630018;
+--------------------------+
| count(distinct clientIp) |
+--------------------------+
| 42 |
+--------------------------+
1 row in set (0.00 sec)
mysql:research@analytics-store.eqiad.wmnet [log]> select
count(distinct
event_pageViewToken) from
CompletionSuggestions_13630018;
+-------------------------------------+
| count(distinct event_pageViewToken) |
+-------------------------------------+
| 13 |
+-------------------------------------+
1 row in set (0.00 sec)
My best guess at this point is that something has changed in the way
these
clientIp's are collected and is incorrect.
On Mon, Sep 14, 2015 at 1:32 PM, Erik Bernhardson
<ebernhardson(a)wikimedia.org> wrote:
>
> Thanks for taking a look over this. I've incorperated your suggestions
> into a patch[1] and if all looks good will send that out in SWAT. We
should
> be able to look at the data collected
overnight and see if things are
more
> sane tomorrow.
>
> [1]
https://gerrit.wikimedia.org/r/#/c/238306/
>
> On Mon, Sep 14, 2015 at 11:56 AM, Gergo Tisza <gtisza(a)wikimedia.org>
> wrote:
>>
>> You are queueing a logging callback every time a request is sent
(which
>> is roughly every time the user types
another character in the search
box)
>> until the tracking module finishes
loading and
mw.searchSuggest.request is
>> restored. On a slow connection the user
might type several
characters and
>> trigger several log events by then. If
you filter for queries from
the same
>> non-unique IP, you will probably see
something like "a", "ab",
"abc"...
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Count Logula
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org