Update; I read Dan's thread about hashing, read this thread, and a
penny dropped ;).
This is totally explainable by the fact that we /expect/ to see
multiple pageIDs per IP. And we are! The hashing problem just means
those aren't /appearing/ to be the same IP.
On 15 September 2015 at 18:05, Erik Bernhardson
<ebernhardson(a)wikimedia.org> wrote:
We've deployed the change to bucketing, but we are
still seeing the same
issue in the collected data.
Again we are generating a unique 64 bit random number when the user gets to
the page. We are seeing this same 64 bit unique number being reported by
multiple ip addresses.
Since deploying the new schema number with the updated bucket selection we
have seen 13 distinct tokens coming from 42 distinct ip addresses. This
shouldn't be possible.
mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct
clientIp) from CompletionSugges
tions_13630018;
+--------------------------+
| count(distinct clientIp) |
+--------------------------+
| 42 |
+--------------------------+
1 row in set (0.00 sec)
mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct
event_pageViewToken) from CompletionSuggestions_13630018;
+-------------------------------------+
| count(distinct event_pageViewToken) |
+-------------------------------------+
| 13 |
+-------------------------------------+
1 row in set (0.00 sec)
My best guess at this point is that something has changed in the way these
clientIp's are collected and is incorrect.
On Mon, Sep 14, 2015 at 1:32 PM, Erik Bernhardson
<ebernhardson(a)wikimedia.org> wrote:
Thanks for taking a look over this. I've incorperated your suggestions
into a patch[1] and if all looks good will send that out in SWAT. We should
be able to look at the data collected overnight and see if things are more
sane tomorrow.
[1]
https://gerrit.wikimedia.org/r/#/c/238306/
On Mon, Sep 14, 2015 at 11:56 AM, Gergo Tisza <gtisza(a)wikimedia.org>
wrote:
You are queueing a logging callback every time a request is sent (which
is roughly every time the user types another character in the search box)
until the tracking module finishes loading and mw.searchSuggest.request is
restored. On a slow connection the user might type several characters and
trigger several log events by then. If you filter for queries from the same
non-unique IP, you will probably see something like "a", "ab",
"abc"...
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics