We've deployed the change to bucketing, but we are still seeing the same issue in the collected data. 

Again we are generating a unique 64 bit random number when the user gets to the page. We are seeing this same 64 bit unique number being reported by multiple ip addresses. 

Since deploying the new schema number with the updated bucket selection we have seen 13 distinct tokens coming from 42 distinct ip addresses. This shouldn't be possible.

mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct clientIp) from CompletionSugges
tions_13630018;                                                                                        
+--------------------------+
| count(distinct clientIp) |
+--------------------------+
|                       42 |
+--------------------------+
1 row in set (0.00 sec)

mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct event_pageViewToken) from CompletionSuggestions_13630018;

+-------------------------------------+
| count(distinct event_pageViewToken) |
+-------------------------------------+
|                                  13 |
+-------------------------------------+
1 row in set (0.00 sec)


My best guess at this point is that something has changed in the way these clientIp's are collected and is incorrect.


On Mon, Sep 14, 2015 at 1:32 PM, Erik Bernhardson <ebernhardson@wikimedia.org> wrote:
Thanks for taking a look over this. I've incorperated your suggestions into a patch[1] and if all looks good will send that out in SWAT. We should be able to look at the data collected overnight and see if things are more sane tomorrow.

[1] https://gerrit.wikimedia.org/r/#/c/238306/

On Mon, Sep 14, 2015 at 11:56 AM, Gergo Tisza <gtisza@wikimedia.org> wrote:
​You are queueing a logging callback every time a request is sent (which is roughly every time the user types another character in the search box) until the tracking module finishes loading and mw.searchSuggest.request is restored. On a slow connection the user might type several characters and trigger several log events by then. If you filter for queries from the same non-unique IP, you will probably see something like "a", "ab", "abc"...

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics