Re: [Multimedia] [Analytics] EventLogging ballooning

19 May 2014

      On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc gilles@wikimedia.org wrote:
...
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki
...
fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing
the sampling rate for the large wikis only. The small ones going back to
1:1 would be fine, as they contribute little to the global load.
That solves part of the problem, but not all of it. For example, how do we
display click-to-thumbnail time in Kenya on our map? Presumably most people
there use the English or French Wikipedia, which are large ones, but the
traffic from Kenya is small, sampling will pretty much destroy it. Same for
rare actions like clicking on the author name.
Basically we should the segments which are large in all dimensions (e.g.
thumbnail clicks on enwiki from US), and only sample those.
Is there a way to set different PHP settings for small wikipedias than for
...
large ones, though?
InitializeSettings.php can take wiki names directly, or any of the dblists
from the operations/mediawiki-config repo root (s* and small/medium/large
would be the helpful ones here).
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate
...
...

ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))

I don't follow the logic here. Like percentiles, averages should be
unaffected by sampling, geometric or not.
Assume we have 10 duration logs with  1 sec time and 10 with 2 sec; the
(arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we
take the average of that, that would give 1.1 sec; our one sample from the
second group really represents 10 events, but only has the weight of one.
The same logic should hold for geometric means.
I think averages would be unaffected by *uniform* sampling; but we are not
doing uniform sampling here; even if we are only doing per-wiki sampling,
we might need to aggregate data from differently sampled groups for a
cross-wiki comparison chart, for example.
(I suspect percentiles would be affected by non-uniform sampling as well,
but I don't really have an idea how.)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Multimedia] [Analytics] EventLogging ballooning