On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
Confirmations needed: https://www.mediawiki.org/?diff=1006822&oldid=964133
I am not sure if this is the info you are looking for but just in case, what was said on this thread back in March still stands: http://lists.wikimedia.org/pipermail/analytics/2014-March/001681.html
Some newer information on the sanitization front:
We are seriously thinking about implementing an incognito mode. Incognito mode will be "on" by default if you browse with cookies off. That is, if your browser is set to not make use of cookies, no data will be sampled by EL. This idea seems that is gaining ground and probably will turn into a project soon.
Regarding anonymization: after much discussion we believe that to properly anonymize EL data there is no other solution than aggregation.Recent events on this front include us pumping EL data to kafka from which we can pump it into hadoop. There data will go through ETL process to be sanitized. It is our current plan to discharge original raw logs once data is sanitized.
Note that IPs are sanitized in EL and they always been so. That is not the case of user agents that are stored raw.
Let us know if you are looking for other info besides the one provided here.
Thanks,
Nuria
On Fri, May 16, 2014 at 6:34 PM, Ori Livneh ori@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
We log one duration event and a bunch of user action events (clicked on button X etc) per lightbox view. Is that a problem? On what front? (Logging infrastructure performance? Privacy? Client-side performance?) Expectations on how to use EventLogging should be clearly documented somewhere.
I can add sampling to these events, if this is a serious problem; although logging several events per page view is a pretty standard thing for analytics tools AFAIK. (Also if there are performance issues, maybe EventLogging should provide a batch logging option?)
Gergo,
Can we reduce the logging rate?
Every event is a row in the database as EL is backed up by a db. So 170 events per sec means a lot of rows being created.
Thanks,
Nuria
On Fri, May 16, 2014 at 7:03 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
We log one duration event and a bunch of user action events (clicked on button X etc) per lightbox view. Is that a problem? On what front? (Logging infrastructure performance? Privacy? Client-side performance?) Expectations on how to use EventLogging should be clearly documented somewhere.
I can add sampling to these events, if this is a serious problem; although logging several events per page view is a pretty standard thing for analytics tools AFAIK. (Also if there are performance issues, maybe EventLogging should provide a batch logging option?)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, May 16, 2014 at 10:03 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
We log one duration event and a bunch of user action events (clicked on button X etc) per lightbox view. Is that a problem? On what front? (Logging infrastructure performance? Privacy? Client-side performance?) Expectations on how to use EventLogging should be clearly documented somewhere.
I can add sampling to these events, if this is a serious problem; although logging several events per page view is a pretty standard thing for analytics tools AFAIK. (Also if there are performance issues, maybe EventLogging should provide a batch logging option?)
Hey Gergő, all valid points. Don't worry, you're not in trouble :) If you have a moment, can you log on IRC (#wikimedia-analytics) and help us coordinate a fix?
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
After an IRC discussion we added 1:1000 sampling to both of those schemas. I'll need a little help fixing things on the data processing side; I'll give a short description of how we use the data first.
A MediaViewer event represents a user action (e.g. clicking on a thumbnail, or using the back button in the browser while the lightbox is open). The most used actions are (were, before the sampling) logged a few million times a day; the least used ones less than a thousand times. We use the data to display graphs like this: http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab There are also per-wiki graphs; there is about three magnitudes of difference between the largest and the smallest wikis (will be more once we roll out on English).
A MultimediaViewerDuration event contains data about how much the user had to wait (such as milliseconds between clicking the thumbnail and displaying the image). This is fairly new and we don't have graphs yet, but they will look something like these (which show the latency of our network requests): http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_perform... http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_pe... that is, they are used to calculate a geometric mean and various percentiles, with per-wiki and per-country breakdown.
What I would like to understand is: 1) how we need to modify these charts to account for the sampling, 2) how we can make sure the sampling does not result in loss of low-volume data (e.g. from wikis which have less traffic).
== How to take the sampling into account ==
For the activity charts which show total event counts, this is easy: we just need to multiply the count by the sampling ratio.
For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so 10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).
I'm still looking for an answer on what effect sampling has on geometric means.
== How to handle data sources with very different volumes ==
As I said above, there are about three magnitudes of difference between data volume for frequent and rare user actions, and also between large and small wikis (probably even more for countries - if you look at the map linked above, you can see that some African countries are missing: we use 1:1000 sampling and haven't collected a single data point there yet).
So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action: 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The question is, how to mix different data sources? For example, we might decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on dewiki, and then we want to show a graph of global clicks which includes both enwiki and dewiki counts.
Here is what I came up with: - we add a "sampling rate" field to all our schemas - the rule to determine the sampling rate of a given event (i.e. the reciprocal of the probability of the event getting logged) can be as difficult as we like, as long as the logging code saves that number as well - whenever we display total counts, we use sum(sampling_rate) instead of count(*) - whenever we display percentiles, we ignore sampling rates, they should not influence the result even if we consider data from multiple sources with mixed sampling rates (I'm not quite sure about this one) - whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
Do you think that would yield correct results?
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back to 1:1 would be fine, as they contribute little to the global load. Is there a way to set different PHP settings for small wikipedias than for large ones, though?
- whenever we display total counts, we use sum(sampling_rate) instead of
count(*)
The query for actions is a bit more complex: https://git.wikimedia.org/blob/analytics%2Fmultimedia.git/1fa576fabbf6598f06... "THEN sampling_rate ELSE 0" should work, afaik.
- whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate
- ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.
I'll go ahead and write changesets to add sampling_rate to the schemas and Media Viewer's code, we're going to need that anyway.
On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
After an IRC discussion we added 1:1000 sampling to both of those schemas. I'll need a little help fixing things on the data processing side; I'll give a short description of how we use the data first.
A MediaViewer event represents a user action (e.g. clicking on a thumbnail, or using the back button in the browser while the lightbox is open). The most used actions are (were, before the sampling) logged a few million times a day; the least used ones less than a thousand times. We use the data to display graphs like this: http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab There are also per-wiki graphs; there is about three magnitudes of difference between the largest and the smallest wikis (will be more once we roll out on English).
A MultimediaViewerDuration event contains data about how much the user had to wait (such as milliseconds between clicking the thumbnail and displaying the image). This is fairly new and we don't have graphs yet, but they will look something like these (which show the latency of our network requests):
http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_perform...
http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_pe... that is, they are used to calculate a geometric mean and various percentiles, with per-wiki and per-country breakdown.
What I would like to understand is: 1) how we need to modify these charts to account for the sampling, 2) how we can make sure the sampling does not result in loss of low-volume data (e.g. from wikis which have less traffic).
== How to take the sampling into account ==
For the activity charts which show total event counts, this is easy: we just need to multiply the count by the sampling ratio.
For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so 10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).
I'm still looking for an answer on what effect sampling has on geometric means.
== How to handle data sources with very different volumes ==
As I said above, there are about three magnitudes of difference between data volume for frequent and rare user actions, and also between large and small wikis (probably even more for countries - if you look at the map linked above, you can see that some African countries are missing: we use 1:1000 sampling and haven't collected a single data point there yet).
So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action: 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The question is, how to mix different data sources? For example, we might decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on dewiki, and then we want to show a graph of global clicks which includes both enwiki and dewiki counts.
Here is what I came up with:
- we add a "sampling rate" field to all our schemas
- the rule to determine the sampling rate of a given event (i.e. the
reciprocal of the probability of the event getting logged) can be as difficult as we like, as long as the logging code saves that number as well
- whenever we display total counts, we use sum(sampling_rate) instead of
count(*)
- whenever we display percentiles, we ignore sampling rates, they should
not influence the result even if we consider data from multiple sources with mixed sampling rates (I'm not quite sure about this one)
- whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate
- ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
Do you think that would yield correct results?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc gilles@wikimedia.org wrote:
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki
fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back to 1:1 would be fine, as they contribute little to the global load.
That solves part of the problem, but not all of it. For example, how do we display click-to-thumbnail time in Kenya on our map? Presumably most people there use the English or French Wikipedia, which are large ones, but the traffic from Kenya is small, sampling will pretty much destroy it. Same for rare actions like clicking on the author name.
Basically we should the segments which are large in all dimensions (e.g. thumbnail clicks on enwiki from US), and only sample those.
Is there a way to set different PHP settings for small wikipedias than for
large ones, though?
InitializeSettings.php can take wiki names directly, or any of the dblists from the operations/mediawiki-config repo root (s* and small/medium/large would be the helpful ones here).
- whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate
- ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.
Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric means.
I think averages would be unaffected by *uniform* sampling; but we are not doing uniform sampling here; even if we are only doing per-wiki sampling, we might need to aggregate data from differently sampled groups for a cross-wiki comparison chart, for example.
(I suspect percentiles would be affected by non-uniform sampling as well, but I don't really have an idea how.)
[gerco] - whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
[gilles] I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.
[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the >second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric >means.
What variable are we measuring with this data that we are averaging?
On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc gilles@wikimedia.org wrote:
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back to 1:1 would be fine, as they contribute little to the global load.
That solves part of the problem, but not all of it. For example, how do we display click-to-thumbnail time in Kenya on our map? Presumably most people there use the English or French Wikipedia, which are large ones, but the traffic from Kenya is small, sampling will pretty much destroy it. Same for rare actions like clicking on the author name.
Basically we should the segments which are large in all dimensions (e.g. thumbnail clicks on enwiki from US), and only sample those.
Is there a way to set different PHP settings for small wikipedias than for large ones, though?
InitializeSettings.php can take wiki names directly, or any of the dblists from the operations/mediawiki-config repo root (s* and small/medium/large would be the helpful ones here).
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.
Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric means.
I think averages would be unaffected by uniform sampling; but we are not doing uniform sampling here; even if we are only doing per-wiki sampling, we might need to aggregate data from differently sampled groups for a cross-wiki comparison chart, for example.
(I suspect percentiles would be affected by non-uniform sampling as well, but I don't really have an idea how.)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I agree that there's no reason to re-weight the observations under a consistent sample. The only reason I might re-weight based on the sample would be if I were combining data with different sampling rates.
-Aaron
On Tue, May 20, 2014 at 8:18 AM, Nuria Ruiz nuria@wikimedia.org wrote:
[gerco] - whenever we display geometric means, we weight by sampling
rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
[gilles] I don't follow the logic here. Like percentiles, averages
should be unaffected by sampling, geometric or not.
[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2
sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the >second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric >means. What variable are we measuring with this data that we are averaging?
On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc gilles@wikimedia.org
wrote:
1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses
Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back
to 1:1
would be fine, as they contribute little to the global load.
That solves part of the problem, but not all of it. For example, how do
we
display click-to-thumbnail time in Kenya on our map? Presumably most
people
there use the English or French Wikipedia, which are large ones, but the traffic from Kenya is small, sampling will pretty much destroy it. Same
for
rare actions like clicking on the author name.
Basically we should the segments which are large in all dimensions (e.g. thumbnail clicks on enwiki from US), and only sample those.
Is there a way to set different PHP settings for small wikipedias than
for
large ones, though?
InitializeSettings.php can take wiki names directly, or any of the
dblists
from the operations/mediawiki-config repo root (s* and small/medium/large would be the helpful ones here).
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.
Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from
the
second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric means.
I think averages would be unaffected by uniform sampling; but we are not doing uniform sampling here; even if we are only doing per-wiki
sampling, we
might need to aggregate data from differently sampled groups for a cross-wiki comparison chart, for example.
(I suspect percentiles would be affected by non-uniform sampling as well, but I don't really have an idea how.)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I agree that there's no reason to re-weight the observations under a consistent sample. The only reason I might re-weight based on the sample would be if I were combining data with different sampling rates.
The problem is, if we change sampling rates dynamically as we did on Friday, we are forced into a mixed sampling rate world. I think this deserves some real focus.
The problem is, if we change sampling rates dynamically as we did on Friday, we are forced into a mixed sampling rate world.
This is only relevant if we are using sampling data to calculate absolute counts. For percentiles and means it doesn't matter as long as you are not using two datasets sampled at a different ratio to calculate, say, an average.
On Tue, May 20, 2014 at 7:36 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
I agree that there's no reason to re-weight the observations under a consistent sample. The only reason I might re-weight based on the sample would be if I were combining data with different sampling rates.
The problem is, if we change sampling rates dynamically as we did on Friday, we are forced into a mixed sampling rate world. I think this deserves some real focus.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, May 20, 2014 at 6:18 AM, Nuria Ruiz nuria@wikimedia.org wrote:
[gerco] - whenever we display geometric means, we weight by sampling
rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
[gilles] I don't follow the logic here. Like percentiles, averages
should be unaffected by sampling, geometric or not.
[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2
sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the >second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric >means. What variable are we measuring with this data that we are averaging?
The duration log shows the total time it takes for the viewer to load itself and the image data (milliseconds between clicking on the thumbnail and displaying the image). We want to sample this on large wikis since it generates a lot of data. We want to not sample this on small wikis since they generate very little data and the sampling would make it unreliable.
We want to display average loading time for each wiki as decisions to enable/disable by default on that wiki should be informed by that stat (some wikis can have very different loading times due to network geographics). We also want to display global average loading time, which is an average of all the logged loading times (which, per above, use different sampling). We might event want to display per-country loading times, which is an even more random mix of data from different wikis.
The duration log shows
I think you're focusing too much on the duration log which isn't graphed yet. Implementing graphs for that data has been constantly postponed in our cycle planning because it's been considered lower priority than the rest. We can focus on challenges specific to that data whenever it gets picked up.
We also want to display global average loading time, which is an average of
all the logged loading times (which, per above, use different sampling). We might event want to display per-country loading times, which is an even more random mix of data from different wikis.
Having every graph and metric possible isn't necessarily a useful goal. Specific graphs are only worth having if they provide actionable conclusions that can't be found by looking at other graphs. For example, not being able to generate global graphs isn't that big a deal if we can draw the same conclusions they would provide by looking at the graphs of very large wikis. An entertaining graph isn't necessarily useful.
At this point the action log is the only one likely to have mixed sampling, but we only use that one for totals, not averages/percentiles. The only metrics we're displaying averages and percentiles for have consistent sampling across all wikis. Even for the duration log, there is consistent sampling at the moment, and it's so similar to the other sampled metrics we currently have that I don't foresee the need to introduced mixed sampling.
As for adapting the consistent sampling we currently have on our sampled logs to improve the accuracy of metrics on small countries/small wikis where the sample size is too small, is it really useful? Are we likely to find that increasing the accuracy of the measurement of a specific metric in a given African country will tell us something we don't already know? There's plenty of useful data on metrics with decent sample sizes, I think that trying to increase the sample size of each small metric for each small country is a little futile.
On Tue, May 20, 2014 at 8:38 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, May 20, 2014 at 6:18 AM, Nuria Ruiz nuria@wikimedia.org wrote:
[gerco] - whenever we display geometric means, we weight by sampling
rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
[gilles] I don't follow the logic here. Like percentiles, averages
should be unaffected by sampling, geometric or not.
[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2
sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the >second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric >means. What variable are we measuring with this data that we are averaging?
The duration log shows the total time it takes for the viewer to load itself and the image data (milliseconds between clicking on the thumbnail and displaying the image). We want to sample this on large wikis since it generates a lot of data. We want to not sample this on small wikis since they generate very little data and the sampling would make it unreliable.
We want to display average loading time for each wiki as decisions to enable/disable by default on that wiki should be informed by that stat (some wikis can have very different loading times due to network geographics). We also want to display global average loading time, which is an average of all the logged loading times (which, per above, use different sampling). We might event want to display per-country loading times, which is an even more random mix of data from different wikis.
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
[gerco] We also want to display global average loading time, which is an average of all the logged loading times (which, per above, use different sampling).
[gilles] Having every graph and metric possible isn't necessarily a useful goal. Specific graphs are only worth having if they provide actionable conclusions that can't be found by looking at other graphs.
Agreed. I was about to send to Gerco a response along the same line. I think a graph of "global average loading time" is not very useful. The main point of graphing for performance is to "check" the health of the system and provide "actionable" data. A global metric like the one you are describing provides neither in this case. It would be a poor measure of overall health of the system as it does not represent closely the user experience when interacting with the system for neither users with warm or cold cache. And it does not provide clear actionable data as it is too much a "bird view" of the system. You would need to drill the percentile data per wiki to find actionable items.
[gerco] We might event want to display per-country loading times,
[gilles] There's plenty of useful data on metrics with decent sample sizes, I think that trying to increase the sample size of each small metric for each small country is a little futile.
Also agree here. If there is a true use case for which we need this information we can work on it but let's not drown ourselves on data, initial per wiki percentile graphs are likely to provide many actionable points.
On Wed, May 21, 2014 at 10:13 AM, Gilles Dubuc gilles@wikimedia.org wrote:
The duration log shows
I think you're focusing too much on the duration log which isn't graphed yet. Implementing graphs for that data has been constantly postponed in our cycle planning because it's been considered lower priority than the rest. We can focus on challenges specific to that data whenever it gets picked up.
We also want to display global average loading time, which is an average of all the logged loading times (which, per above, use different sampling). We might event want to display per-country loading times, which is an even more random mix of data from different wikis.
Having every graph and metric possible isn't necessarily a useful goal. Specific graphs are only worth having if they provide actionable conclusions that can't be found by looking at other graphs. For example, not being able to generate global graphs isn't that big a deal if we can draw the same conclusions they would provide by looking at the graphs of very large wikis. An entertaining graph isn't necessarily useful.
At this point the action log is the only one likely to have mixed sampling, but we only use that one for totals, not averages/percentiles. The only metrics we're displaying averages and percentiles for have consistent sampling across all wikis. Even for the duration log, there is consistent sampling at the moment, and it's so similar to the other sampled metrics we currently have that I don't foresee the need to introduced mixed sampling.
As for adapting the consistent sampling we currently have on our sampled logs to improve the accuracy of metrics on small countries/small wikis where the sample size is too small, is it really useful? Are we likely to find that increasing the accuracy of the measurement of a specific metric in a given African country will tell us something we don't already know? There's plenty of useful data on metrics with decent sample sizes, I think that trying to increase the sample size of each small metric for each small country is a little futile.
On Tue, May 20, 2014 at 8:38 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, May 20, 2014 at 6:18 AM, Nuria Ruiz nuria@wikimedia.org wrote:
[gerco] - whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
[gilles] I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not.
[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we take the average of that, that would give 1.1 sec; our one sample from the >second group really represents 10 events, but only has the weight of one. The same logic should hold for geometric >means.
What variable are we measuring with this data that we are averaging?
The duration log shows the total time it takes for the viewer to load itself and the image data (milliseconds between clicking on the thumbnail and displaying the image). We want to sample this on large wikis since it generates a lot of data. We want to not sample this on small wikis since they generate very little data and the sampling would make it unreliable.
We want to display average loading time for each wiki as decisions to enable/disable by default on that wiki should be informed by that stat (some wikis can have very different loading times due to network geographics). We also want to display global average loading time, which is an average of all the logged loading times (which, per above, use different sampling). We might event want to display per-country loading times, which is an even more random mix of data from different wikis.
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as >long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so >10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).
Correct, there is no adjustment needed in this case, we are just reducing the sample to the size we need to be able to calculate a percentile with an aceptable level of confidence. This is a simplification that should work well in this case.
I'm still looking for an answer on what effect sampling has on geometric means.
If the sampling we have is good enough to calculate a 90th or 99th percentile (which it is) I do not see why you would need to adjust your geometric mean in any way. Please anyone correct me if I am wrong but I believe that if you want a measure of confidence of how spread out are your values you can calculate the geometric standard deviation and find out.
So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action:
Correct. Every action you are inter-comparing should have a sample size that lets you calculate, say, a percentile 99 with acceptable confidence. Per our rule above 100.000 samples or more (this is, again, a simplification that should work well in this case)
Now, are you really interested in detailing user behavior of your feature per wiki? Is the expectation that users from es.wikipedia have a fundamentally different experience than users from fr.wikipedia? Or are we studying "global" usage? If we need different samples size per wiki the most logical way to do it is to have a sampling configuration deployed per wiki rather than changing the schemas. (Need to check whether mediawiki config allows for this)
-whenever we display percentiles, we ignore sampling rates, they should not influence the result even if we consider >data from multiple sources with mixed sampling rates (I'm not quite sure about this one)
This is only correct if you have a sufficient sample size in all datasets to calculate percentiles with aceptable confidence. Example (simplifying things a bunch to rules of thumb): you are interested in percentile 90 and you have dataset 1 with 100.000 points, dataset 2 with 500.000 an dataset 3 with 1000. You can inter-compare percentile 90 in dataset 1 and 2 but in dataset 3 there is not enough data to calculate the 90th percentile.
On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:34 AM, Ori Livneh ori@wikimedia.org wrote:
On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
- From 40 to 260 events logged per second in a month: what's going on?
Eep, thanks for raising the alarm. MediaViewer is 170 events / sec, MultimediaViewerDuration is 38 / sec.
+CC Multimedia.
After an IRC discussion we added 1:1000 sampling to both of those schemas. I'll need a little help fixing things on the data processing side; I'll give a short description of how we use the data first.
A MediaViewer event represents a user action (e.g. clicking on a thumbnail, or using the back button in the browser while the lightbox is open). The most used actions are (were, before the sampling) logged a few million times a day; the least used ones less than a thousand times. We use the data to display graphs like this: http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab There are also per-wiki graphs; there is about three magnitudes of difference between the largest and the smallest wikis (will be more once we roll out on English).
A MultimediaViewerDuration event contains data about how much the user had to wait (such as milliseconds between clicking the thumbnail and displaying the image). This is fairly new and we don't have graphs yet, but they will look something like these (which show the latency of our network requests): http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_perform... http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_pe... that is, they are used to calculate a geometric mean and various percentiles, with per-wiki and per-country breakdown.
What I would like to understand is: 1) how we need to modify these charts to account for the sampling, 2) how we can make sure the sampling does not result in loss of low-volume data (e.g. from wikis which have less traffic).
== How to take the sampling into account ==
For the activity charts which show total event counts, this is easy: we just need to multiply the count by the sampling ratio.
For percentile charts, my understanding is (thanks for the IRC advice, Nuria and Leila!) that they remain accurate, as long as the amount sampled is large enough; the best practice is to sample at least 1000 events per bucket (so 10,000 altogether if we are looking for the 90th percentile, 100,000 if we are looking for the 99th percentile etc).
I'm still looking for an answer on what effect sampling has on geometric means.
== How to handle data sources with very different volumes ==
As I said above, there are about three magnitudes of difference between data volume for frequent and rare user actions, and also between large and small wikis (probably even more for countries - if you look at the map linked above, you can see that some African countries are missing: we use 1:1000 sampling and haven't collected a single data point there yet).
So to get a proper amount of data, we would probably need to vary sampling per wiki or country, and also per action: 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The question is, how to mix different data sources? For example, we might decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on dewiki, and then we want to show a graph of global clicks which includes both enwiki and dewiki counts.
Here is what I came up with:
- we add a "sampling rate" field to all our schemas
- the rule to determine the sampling rate of a given event (i.e. the
reciprocal of the probability of the event getting logged) can be as difficult as we like, as long as the logging code saves that number as well
- whenever we display total counts, we use sum(sampling_rate) instead of
count(*)
- whenever we display percentiles, we ignore sampling rates, they should not
influence the result even if we consider data from multiple sources with mixed sampling rates (I'm not quite sure about this one)
- whenever we display geometric means, we weight by sampling rate
(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value))))
Do you think that would yield correct results?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
multimedia@lists.wikimedia.org