Re: [Analytics] [Multimedia] EventLogging ballooning

19 May 2014

      ...
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki
fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing
the sampling rate for the large wikis only. The small ones going back to
1:1 would be fine, as they contribute little to the global load. Is there a
way to set different PHP settings for small wikipedias than for large ones,
though?
- whenever we display total counts, we use sum(sampling_rate) instead of
...
count(*)
The query for actions is a bit more complex:
https://git.wikimedia.org/blob/analytics%2Fmultimedia.git/1fa576fabbf6598f06...
"THEN sampling_rate ELSE 0" should work, afaik.
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate
...

ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))

I don't follow the logic here. Like percentiles, averages should be
unaffected by sampling, geometric or not.
I'll go ahead and write changesets to add sampling_rate to the schemas and
Media Viewer's code, we're going to need that anyway.
On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza gtisza@wikimedia.org wrote:
...
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
...
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) <nemowiki@gmail.com
...
wrote:
...

From 40 to 260 events logged per second in a month: what's going on?

Eep, thanks for raising the alarm. MediaViewer is 170 events /
sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
After an IRC discussion we added 1:1000 sampling to both of those schemas.
I'll need a little help fixing things on the data processing side; I'll
give a short description of how we use the data first.
A MediaViewer event represents a user action (e.g. clicking on a
thumbnail, or using the back button in the browser while the lightbox is
open). The most used actions are (were, before the sampling) logged a few
million times a day; the least used ones less than a thousand times.
We use the data to display graphs like this:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab
There are also per-wiki graphs; there is about three magnitudes of
difference between the largest and the smallest wikis (will be more once we
roll out on English).
A MultimediaViewerDuration event contains data about how much the user had
to wait (such as milliseconds between clicking the thumbnail and displaying
the image). This is fairly new and we don't have graphs yet, but they will
look something like these (which show the latency of our network requests):
http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_perform...
http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_pe...
that is, they are used to calculate a geometric mean and various
percentiles, with per-wiki and per-country breakdown.
What I would like to understand is: 1) how we need to modify these charts
to account for the sampling, 2) how we can make sure the sampling does not
result in loss of low-volume data (e.g. from wikis which have less traffic).
== How to take the sampling into account ==
For the activity charts which show total event counts, this is easy: we
just need to multiply the count by the sampling ratio.
For percentile charts, my understanding is (thanks for the IRC advice,
Nuria and Leila!) that they remain accurate, as long as the amount sampled
is large enough; the best practice is to sample at least 1000 events per
bucket (so 10,000 altogether if we are looking for the 90th percentile,
100,000 if we are looking for the 99th percentile etc).
I'm still looking for an answer on what effect sampling has on geometric
means.
== How to handle data sources with very different volumes ==
As I said above, there are about three magnitudes of difference between
data volume for frequent and rare user actions, and also between large and
small wikis (probably even more for countries - if you look at the map
linked above, you can see that some African countries are missing: we use
1:1000 sampling and haven't collected a single data point there yet).
So to get a proper amount of data, we would probably need to vary sampling
per wiki or country, and also per action: 1:1000 sampling is fine for
frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The
question is, how to mix different data sources? For example, we might
decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on
dewiki, and then we want to show a graph of global clicks which includes
both enwiki and dewiki counts.
Here is what I came up with:

we add a "sampling rate" field to all our schemas
the rule to determine the sampling rate of a given event (i.e. the

reciprocal of the probability of the event getting logged) can be as
difficult as we like, as long as the logging code saves that number as well

whenever we display total counts, we use sum(sampling_rate) instead of

count(*)

whenever we display percentiles, we ignore sampling rates, they should

not influence the result even if we consider data from multiple sources
with mixed sampling rates (I'm not quite sure about this one)

whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate

ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))

Do you think that would yield correct results?

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Multimedia] EventLogging ballooning