On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc <gilles@wikimedia.org> wrote:
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses

Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back to 1:1 would be fine, as they contribute little to the global load.

That solves part of the problem, but not all of it. For example, how do we display click-to-thumbnail time in Kenya on our map? Presumably most people there use the English or French Wikipedia, which are large ones, but the traffic from Kenya is small, sampling will pretty much destroy it. Same for rare actions like clicking on the author name.

Basically we should the segments which are large in all dimensions (e.g. thumbnail clicks on enwiki from US), and only sample those.

Is there a way to set different PHP settings for small wikipedias than for large ones, though?

InitializeSettings.php can take wiki names directly, or any of the dblists from the operations/mediawiki-config repo root (s* and small/medium/large would be the helpful ones here).

- whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))

I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.

Assume we have 10 duration logs with  1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric means.

I think averages would be unaffected by uniform sampling; but we are not doing uniform sampling here; even if we are only doing per-wiki sampling, we might need to aggregate data from differently sampled groups for a cross-wiki comparison chart, for example.

(I suspect percentiles would be affected by non-uniform sampling as well, but I don't really have an idea how.)