1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki
fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing
the sampling rate for the large wikis only. The small ones going back to
1:1 would be fine, as they contribute little to the global load. Is there a
way to set different PHP settings for small wikipedias than for large ones,
though?
- whenever we display total counts, we use sum(sampling_rate) instead of
count(*)
The query for actions is a bit more complex:
https://git.wikimedia.org/blob/analytics%2Fmultimedia.git/1fa576fabbf6598f0…
"THEN sampling_rate ELSE 0" should work, afaik.
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate
* ln(value)) / sum(sampling_rate)) instead of
exp(avg(ln(value))))
I don't follow the logic here. Like percentiles, averages should be
unaffected by sampling, geometric or not.
I'll go ahead and write changesets to add sampling_rate to the schemas and
Media Viewer's code, we're going to need that anyway.
On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza <gtisza(a)wikimedia.org> wrote:
> On Fri, May 16, 2014 at 9:34 AM, Ori Livneh <ori(a)wikimedia.org> wrote:
>
>> On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com
>> > wrote:
>>
>>> * From 40 to 260 events logged per second in a month: what's going on?
>>
>>
>> Eep, thanks for raising the alarm. MediaViewer is 170 events /
>> sec, MultimediaViewerDuration is 38 / sec.
>>
>> +CC Multimedia.
>>
>
> After an IRC discussion we added 1:1000 sampling to both of those schemas.
> I'll need a little help fixing things on the data processing side; I'll
> give a short description of how we use the data first.
>
> A MediaViewer event represents a user action (e.g. clicking on a
> thumbnail, or using the back button in the browser while the lightbox is
> open). The most used actions are (were, before the sampling) logged a few
> million times a day; the least used ones less than a thousand times.
> We use the data to display graphs like this:
>
http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab
> There are also per-wiki graphs; there is about three magnitudes of
> difference between the largest and the smallest wikis (will be more once we
> roll out on English).
>
> A MultimediaViewerDuration event contains data about how much the user had
> to wait (such as milliseconds between clicking the thumbnail and displaying
> the image). This is fairly new and we don't have graphs yet, but they will
> look something like these (which show the latency of our network requests):
>
>
http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_perfor…
>
>
http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_p…
> that is, they are used to calculate a geometric mean and various
> percentiles, with per-wiki and per-country breakdown.
>
> What I would like to understand is: 1) how we need to modify these charts
> to account for the sampling, 2) how we can make sure the sampling does not
> result in loss of low-volume data (e.g. from wikis which have less traffic).
>
> == How to take the sampling into account ==
>
> For the activity charts which show total event counts, this is easy: we
> just need to multiply the count by the sampling ratio.
>
> For percentile charts, my understanding is (thanks for the IRC advice,
> Nuria and Leila!) that they remain accurate, as long as the amount sampled
> is large enough; the best practice is to sample at least 1000 events per
> bucket (so 10,000 altogether if we are looking for the 90th percentile,
> 100,000 if we are looking for the 99th percentile etc).
>
> I'm still looking for an answer on what effect sampling has on geometric
> means.
>
> == How to handle data sources with very different volumes ==
>
> As I said above, there are about three magnitudes of difference between
> data volume for frequent and rare user actions, and also between large and
> small wikis (probably even more for countries - if you look at the map
> linked above, you can see that some African countries are missing: we use
> 1:1000 sampling and haven't collected a single data point there yet).
>
> So to get a proper amount of data, we would probably need to vary sampling
> per wiki or country, and also per action: 1:1000 sampling is fine for
> frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The
> question is, how to mix different data sources? For example, we might
> decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on
> dewiki, and then we want to show a graph of global clicks which includes
> both enwiki and dewiki counts.
>
> Here is what I came up with:
> - we add a "sampling rate" field to all our schemas
> - the rule to determine the sampling rate of a given event (i.e. the
> reciprocal of the probability of the event getting logged) can be as
> difficult as we like, as long as the logging code saves that number as well
> - whenever we display total counts, we use sum(sampling_rate) instead of
> count(*)
> - whenever we display percentiles, we ignore sampling rates, they should
> not influence the result even if we consider data from multiple sources
> with mixed sampling rates (I'm not quite sure about this one)
> - whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate
* ln(value)) / sum(sampling_rate)) instead of
exp(avg(ln(value))))
> Do you think that would yield correct results?
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
>