For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as >long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so >10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).
Correct, there is no adjustment needed in this case, we are just reducing the sample to the size we need to be able to calculate a percentile with an aceptable level of confidence. This is a simplification that should work well in this case.
I'm still looking for an answer on what effect sampling has on geometric means.
If the sampling we have is good enough to calculate a 90th or 99th percentile (which it is) I do not see why you would need to adjust your geometric mean in any way. Please anyone correct me if I am wrong but I believe that if you want a measure of confidence of how spread out are your values you can calculate the geometric standard deviation and find out.
So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action:
Correct. Every action you are inter-comparing should have a sample size that lets you calculate, say, a percentile 99 with acceptable confidence. Per our rule above 100.000 samples or more (this is, again, a simplification that should work well in this case)
Now, are you really interested in detailing user behavior of your feature per wiki? Is the expectation that users from es.wikipedia have a fundamentally different experience than users from fr.wikipedia? Or are we studying "global" usage? If we need different samples size per wiki the most logical way to do it is to have a sampling configuration deployed per wiki rather than changing the schemas. (Need to check whether mediawiki config allows for this)
-whenever we display percentiles, we ignore sampling rates, they should not influence the result even if we consider >data from multiple sources with mixed sampling rates (I'm not quite sure about this one)
This is only correct if you have a sufficient sample size in all datasets to calculate percentiles with aceptable confidence. Example (simplifying things a bunch to rules of thumb): you are interested in percentile 90 and you have dataset 1 with 100.000 points, dataset 2 with 500.000 an dataset 3 with 1000. You can inter-compare percentile 90 in dataset 1 and 2 but in dataset 3 there is not enough data to calculate the 90th percentile.
On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza wrote:
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) wrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
After an IRC discussion we added 1:1000 sampling to both of those schemas. I'll need a little help fixing things on the data processing side; I'll give a short description of how we use the data first.
A MediaViewer event represents a user action (e.g. clicking on a thumbnail, or using the back button in the browser while the lightbox is open). The most used actions are (were, before the sampling) logged a few million times a day; the least used ones less than a thousand times. We use the data to display graphs like this: There are also per-wiki graphs; there is about three magnitudes of difference between the largest and the smallest wikis (will be more once we roll out on English).
A MultimediaViewerDuration event contains data about how much the user had to wait (such as milliseconds between clicking the thumbnail and displaying the image). This is fairly new and we don't have graphs yet, but they will look something like these (which show the latency of our network requests): that is, they are used to calculate a geometric mean and various percentiles, with per-wiki and per-country breakdown.
What I would like to understand is: 1) how we need to modify these charts to account for the sampling, 2) how we can make sure the sampling does not result in loss of low-volume data (e.g. from wikis which have less traffic).
== How to take the sampling into account ==
For the activity charts which show total event counts, this is easy: we just need to multiply the count by the sampling ratio.
For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so 10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).
I'm still looking for an answer on what effect sampling has on geometric means.
== How to handle data sources with very different volumes ==
As I said above, there are about three magnitudes of difference between data volume for frequent and rare user actions, and also between large and small wikis (probably even more for countries - if you look at the map linked above, you can see that some African countries are missing: we use 1:1000 sampling and haven't collected a single data point there yet).
So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action: 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The question is, how to mix different data sources? For example, we might decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on dewiki, and then we want to show a graph of global clicks which includes both enwiki and dewiki counts.
Here is what I came up with:
- we add a "sampling rate" field to all our schemas
- the rule to determine the sampling rate of a given event (i.e. the
reciprocal of the probability of the event getting logged) can be as difficult as we like, as long as the logging code saves that number as well
- whenever we display total counts, we use sum(sampling_rate) instead of
- whenever we display percentiles, we ignore sampling rates, they should not
influence the result even if we consider data from multiple sources with mixed sampling rates (I'm not quite sure about this one)
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
Do you think that would yield correct results?
Analytics mailing list