Re: [Multimedia] [Analytics] EventLogging ballooning

19 May 2014

      ...
For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as >long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so >10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).
Correct, there is no adjustment needed in this case, we are just
reducing the sample to the size we need to be able to calculate a
percentile with an aceptable level of confidence. This is a
simplification that should work well in this case.
...
I'm still looking for an answer on what effect sampling has on geometric means.
If the sampling we have is good enough to calculate a 90th or 99th
percentile (which it is) I do not see why you would need to adjust
your geometric mean in any way.
Please anyone correct me if I am wrong but I believe that if you want
a measure of confidence of how spread out are your values you can
calculate the geometric standard deviation and find out.
...
So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action:
Correct. Every action you are inter-comparing should have a sample
size that lets you calculate, say, a percentile 99 with acceptable
confidence. Per our rule above 100.000 samples or more (this is,
again, a simplification that should work well in this case)
Now, are you really interested in detailing user behavior of your
feature per wiki? Is the expectation that users from es.wikipedia have
a fundamentally different experience than users from fr.wikipedia? Or
are we studying "global" usage? If we need different samples size per
wiki the most logical way to do it is to have a sampling configuration
deployed per wiki rather than changing the schemas. (Need to check
whether mediawiki config allows for this)
...
-whenever we display percentiles, we ignore sampling rates, they should not influence the result even if we consider >data from multiple sources with mixed sampling rates (I'm not quite sure about this one)
This is only correct if you have a sufficient sample size in all
datasets to calculate percentiles with aceptable confidence. Example
(simplifying things a bunch to rules of thumb): you are interested in
percentile 90 and you have dataset 1 with 100.000 points, dataset 2
with 500.000 an dataset 3 with 1000.  You can inter-compare percentile
90 in dataset 1 and 2 but in dataset 3 there is not enough data to
calculate the 90th percentile.
On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza gtisza@wikimedia.org wrote:
...
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
...
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) nemowiki@gmail.com
wrote:
...

From 40 to 260 events logged per second in a month: what's going on?

Eep, thanks for raising the alarm. MediaViewer is 170 events / sec,
MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
After an IRC discussion we added 1:1000 sampling to both of those schemas.
I'll need a little help fixing things on the data processing side; I'll give
a short description of how we use the data first.
A MediaViewer event represents a user action (e.g. clicking on a thumbnail,
or using the back button in the browser while the lightbox is open). The
most used actions are (were, before the sampling) logged a few million times
a day; the least used ones less than a thousand times.
We use the data to display graphs like this:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab
There are also per-wiki graphs; there is about three magnitudes of
difference between the largest and the smallest wikis (will be more once we
roll out on English).
A MultimediaViewerDuration event contains data about how much the user had
to wait (such as milliseconds between clicking the thumbnail and displaying
the image). This is fairly new and we don't have graphs yet, but they will
look something like these (which show the latency of our network requests):
http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_perform...
http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_pe...
that is, they are used to calculate a geometric mean and various
percentiles, with per-wiki and per-country breakdown.
What I would like to understand is: 1) how we need to modify these charts to
account for the sampling, 2) how we can make sure the sampling does not
result in loss of low-volume data (e.g. from wikis which have less traffic).
== How to take the sampling into account ==
For the activity charts which show total event counts, this is easy: we just
need to multiply the count by the sampling ratio.
For percentile charts, my understanding is (thanks for the IRC advice, Nuria
and Leila!) that they remain accurate, as long as the amount sampled is
large enough; the best practice is to sample at least 1000 events per bucket
(so 10,000 altogether if we are looking for the 90th percentile, 100,000 if
we are looking for the 99th percentile etc).
I'm still looking for an answer on what effect sampling has on geometric
means.
== How to handle data sources with very different volumes ==
As I said above, there are about three magnitudes of difference between data
volume for frequent and rare user actions, and also between large and small
wikis (probably even more for countries - if you look at the map linked
above, you can see that some African countries are missing: we use 1:1000
sampling and haven't collected a single data point there yet).
So to get a proper amount of data, we would probably need to vary sampling
per wiki or country, and also per action: 1:1000 sampling is fine for frwiki
thumbnail clicks, but not for cawiki fullscreen button presses. The question
is, how to mix different data sources? For example, we might decide to
sample thumbnail clicks 1:1000 on enwiki but only 1:100 on dewiki, and then
we want to show a graph of global clicks which includes both enwiki and
dewiki counts.
Here is what I came up with:

we add a "sampling rate" field to all our schemas
the rule to determine the sampling rate of a given event (i.e. the

reciprocal of the probability of the event getting logged) can be as
difficult as we like, as long as the logging code saves that number as well

whenever we display total counts, we use sum(sampling_rate) instead of

count(*)

whenever we display percentiles, we ignore sampling rates, they should not

influence the result even if we consider data from multiple sources with
mixed sampling rates (I'm not quite sure about this one)

whenever we display geometric means, we weight by sampling rate

(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of
exp(avg(ln(value))))
Do you think that would yield correct results?

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Multimedia] [Analytics] EventLogging ballooning