1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses

Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back to 1:1 would be fine, as they contribute little to the global load. Is there a way to set different PHP settings for small wikipedias than for large ones, though?

- whenever we display total counts, we use sum(sampling_rate) instead of count(*)

The query for actions is a bit more complex: https://git.wikimedia.org/blob/analytics%2Fmultimedia.git/1fa576fabbf6598f064e4d05a59171a92bdd2033/actions%2Ftemplate.sql but "THEN sampling_rate ELSE 0" should work, afaik.

- whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))

I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.

I'll go ahead and write changesets to add sampling_rate to the schemas and Media Viewer's code, we're going to need that anyway.




On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza <gtisza@wikimedia.org> wrote:
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh <ori@wikimedia.org> wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:
* From 40 to 260 events logged per second in a month: what's going on?

Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.

+CC Multimedia.

After an IRC discussion we added 1:1000 sampling to both of those schemas. I'll need a little help fixing things on the data processing side; I'll give a short description of how we use the data first.

A MediaViewer event represents a user action (e.g. clicking on a thumbnail, or using the back button in the browser while the lightbox is open). The most used actions are (were, before the sampling) logged a few million times a day; the least used ones less than a thousand times.
There are also per-wiki graphs; there is about three magnitudes of difference between the largest and the smallest wikis (will be more once we roll out on English).

A MultimediaViewerDuration event contains data about how much the user had to wait (such as milliseconds between clicking the thumbnail and displaying the image). This is fairly new and we don't have graphs yet, but they will look something like these (which show the latency of our network requests):
that is, they are used to calculate a geometric mean and various percentiles, with per-wiki and per-country breakdown.

What I would like to understand is: 1) how we need to modify these charts to account for the sampling, 2) how we can make sure the sampling does not result in loss of low-volume data (e.g. from wikis which have less traffic).

== How to take the sampling into account ==

For the activity charts which show total event counts, this is easy: we just need to multiply the count by the sampling ratio.

For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so 10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).

I'm still looking for an answer on what effect sampling has on geometric means.

== How to handle data sources with very different volumes ==

As I said above, there are about three magnitudes of difference between data volume for frequent and rare user actions, and also between large and small wikis (probably even more for countries - if you look at the map linked above, you can see that some African countries are missing: we use 1:1000 sampling and haven't collected a single data point there yet).

So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action: 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The question is, how to mix different data sources? For example, we might decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on dewiki, and then we want to show a graph of global clicks which includes both enwiki and dewiki counts.

Here is what I came up with:
- we add a "sampling rate" field to all our schemas
- the rule to determine the sampling rate of a given event (i.e. the reciprocal of the probability of the event getting logged) can be as difficult as we like, as long as the logging code saves that number as well
- whenever we display total counts, we use sum(sampling_rate) instead of count(*)
- whenever we display percentiles, we ignore sampling rates, they should not influence the result even if we consider data from multiple sources with mixed sampling rates (I'm not quite sure about this one)
- whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))

Do you think that would yield correct results?

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics