After reviewing a weeks worth of data for the commons terms A/B test we
have decided that we have not collected enough information. The initial
sampling was:
1:1000 users chosen to participate in test
Those users split into 6 buckets, giving each bucket a 1:6000 sampling
This has collected ~100 events per bucket, much less in the "strict" bucket
We are increasing the main sampling by 5x, to 1:200. This will give each
bucket a 1:1200 sampling of users. The reason these collect so little data
is that quite a few queries don't meet the minimum requirements to be
effected by the tests. The "aggressive recall" test requires at least 3
words in the query, and the "strict" test requires at least 6 words in the
query.
Erik B.