I agree that there's no reason to re-weight the observations under a consistent sample.  The only reason I might re-weight based on the sample would be if I were combining data with different sampling rates.

-Aaron


On Tue, May 20, 2014 at 8:18 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
>>>[gerco] - whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))

>>[gilles] I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.

>[gerco]Assume we have 10 duration logs with  1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the >second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric >means.
What variable are we measuring with this data that we are averaging?



On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza <gtisza@wikimedia.org> wrote:
> On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc <gilles@wikimedia.org> wrote:
>>>
>>> 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki
>>> fullscreen button presses
>>
>>
>> Since the issue is the global load, I think it'd be resolved by changing
>> the sampling rate for the large wikis only. The small ones going back to 1:1
>> would be fine, as they contribute little to the global load.
>
>
> That solves part of the problem, but not all of it. For example, how do we
> display click-to-thumbnail time in Kenya on our map? Presumably most people
> there use the English or French Wikipedia, which are large ones, but the
> traffic from Kenya is small, sampling will pretty much destroy it. Same for
> rare actions like clicking on the author name.
>
> Basically we should the segments which are large in all dimensions (e.g.
> thumbnail clicks on enwiki from US), and only sample those.
>
>> Is there a way to set different PHP settings for small wikipedias than for
>> large ones, though?
>
>
> InitializeSettings.php can take wiki names directly, or any of the dblists
> from the operations/mediawiki-config repo root (s* and small/medium/large
> would be the helpful ones here).
>
>>> - whenever we display geometric means, we weight by sampling rate
>>> (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of
>>> exp(avg(ln(value))))
>>
>>
>> I don't follow the logic here. Like percentiles, averages should be
>> unaffected by sampling, geometric or not.
>
>
> Assume we have 10 duration logs with  1 sec time and 10 with 2 sec; the
> (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we
> take the average of that, that would give 1.1 sec; our one sample from the
> second group really represents 10 events, but only has the weight of one.
> The same logic should hold for geometric means.
>
> I think averages would be unaffected by uniform sampling; but we are not
> doing uniform sampling here; even if we are only doing per-wiki sampling, we
> might need to aggregate data from differently sampled groups for a
> cross-wiki comparison chart, for example.
>
> (I suspect percentiles would be affected by non-uniform sampling as well,
> but I don't really have an idea how.)
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics