I agree that there's no reason to re-weight the observations under a consistent sample. The only reason I might re-weight based on the sample would be if I were combining data with different sampling rates.
-Aaron
On Tue, May 20, 2014 at 8:18 AM, Nuria Ruiz nuria@wikimedia.org wrote:
[gerco] - whenever we display geometric means, we weight by sampling
rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
[gilles] I don't follow the logic here. Like percentiles, averages
should be unaffected by sampling, geometric or not.
[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2
sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the >second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric >means. What variable are we measuring with this data that we are averaging?
On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc gilles@wikimedia.org
wrote:
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back
to 1:1
would be fine, as they contribute little to the global load.
That solves part of the problem, but not all of it. For example, how do
we
display click-to-thumbnail time in Kenya on our map? Presumably most
people
there use the English or French Wikipedia, which are large ones, but the traffic from Kenya is small, sampling will pretty much destroy it. Same
for
rare actions like clicking on the author name.
Basically we should the segments which are large in all dimensions (e.g. thumbnail clicks on enwiki from US), and only sample those.
Is there a way to set different PHP settings for small wikipedias than
for
large ones, though?
InitializeSettings.php can take wiki names directly, or any of the
dblists
from the operations/mediawiki-config repo root (s* and small/medium/large would be the helpful ones here).
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.
Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from
the
second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric means.
I think averages would be unaffected by uniform sampling; but we are not doing uniform sampling here; even if we are only doing per-wiki
sampling, we
might need to aggregate data from differently sampled groups for a cross-wiki comparison chart, for example.
(I suspect percentiles would be affected by non-uniform sampling as well, but I don't really have an idea how.)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics