Also a side note:
Math.floor( Math.random() * populationSize ) === 0
This function is not random, it has had issues since creation and we have
seen many duplicates of it in Safari specially. For a true random you
should use the cryptoAPI when available:
https://developer.mozilla.org/en-US/docs/Web/API/Window/crypto
Now, since you have a random user session Id already from mediawiki (and
that code uses crypto API when available). Could you use that session id
also to segment your users? Javascript doesn't work with numbers that big
(64 bits) but you could use just part of the returned number to do your
probabilistic assignation to groups.
On Mon, Sep 14, 2015 at 10:12 AM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
Last week we started up a new AB test[1] comparing the
existing completion
suggestions against a new completion suggestion API. This very simply puts
1 in 10000 users into the test bucket, and a further 1 in 10000 users into
the control bucket like so:
- function oneIn(population) {
- return Math.floor( Math.random() * populationSize ) === 0;
- }
- if ( oneIn( 10000 ) ) {
- // test bucket
- } else if ( oneIn ( 10000 ) ) {
- // sample bucket
- } else {
- return; // rejected
- }
-
On every page load we generate a random 64 bit number via
`mw.user.generateRandomSessionId()`. This is used to correlate together
events that were performed by the same user on the same page. This is
logged with all our events as event_pageId. In older tests (this was
turned off September 3rd) using this same event_pageId scheme roughly 0.3%
of event_pageId values came from multiple IP addresses, which seems sane
and normal:
- mysql:research@analytics-store.eqiad.wmnet [log]> select count,
count(count) from (select count(distinct clientIp) as count from
TestSearchSatisfaction_12423691 group by event_pageId) x group by
count;
- +-------+--------------+
- | count | count(count) |
- +-------+--------------+
- | 1 | 411104 |
- | 2 | 1500 |
- +-------+--------------+
- 2 rows in set (3.11 sec)
-
On the test we just started though, we are seeing 48% of event_pageId
values being reported by multiple ip addresses. We can't seem to find any
way to explain why this has changed so much, and as such are uncertain we
can rely on the other data collected by this same test.
- mysql:research@analytics-store.eqiad.wmnet [log]> select count,
count(count) from (select count(distinct clientIp) as count from
CompletionSuggestions_13424343 group by event_pageId) x group by
count;
- +-------+--------------+
- | count | count(count) |
- +-------+--------------+
- | 1 | 1176 |
- | 2 | 243 |
- | 3 | 254 |
- | 4 | 212 |
- | 5 | 143 |
- | 6 | 102 |
- | 7 | 64 |
- | 8 | 36 |
- | 9 | 16 |
- | 10 | 14 |
- | 11 | 8 |
- | 12 | 5 |
- +-------+--------------+
- 12 rows in set (0.03 sec)
We have a third schema in production that has been collecting events the
entire time. It seems to have started showing this issue on September 10th
which lines up with a thursday train deployment:
mysql:research@analytics-store.eqiad.wmnet [log]> select date,
MAX(count) from (select substr(timestamp, 1, 8) as date, count(distinct
clientIp) as count from TestSearchSatisfaction2_13223897 group by
substr(timestamp, 1, 8), event_pageId) x group by date;;
- +----------+------------+
- | date | MAX(count) |
- +----------+------------+
- | 20150902 | 1 |
- | 20150903 | 2 |
- | 20150904 | 2 |
- | 20150905 | 4 |
- | 20150906 | 3 |
- | 20150907 | 3 |
- | 20150908 | 3 |
- | 20150909 | 3 |
- | 20150910 | 11 |
- | 20150911 | 12 |
- | 20150912 | 14 |
- | 20150913 | 18 |
- | 20150914 | 13 |
- +----------+------------+
- 13 rows in set (1.74 sec)
Does anyone have any ideas for where this change could have come from?
[1]
https://gerrit.wikimedia.org/r/#/c/236937/1/modules/ext.wikimediaEvents.sea…
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics