Hi guys,
Does anyone know why the Media Viewer metrics dashboards seem to be stuck with old data from Friday?
http://multimedia-metrics.wmflabs.org/dashboards/mmv
Is there anything we could fiddle with to get the new data to show up?
Thanks for any insights :)
Fabrice
_______________________________
Fabrice Florin Product Manager Wikimedia Foundation
Media Viewer's usage of EventLogging grew considerably because of all the tracking we're doing: http://lists.wikimedia.org/pipermail/analytics/2014-May/002053.html and Nuria asked us to reduce the rate.
Due to the global size we're dealing with, instead of logging every action on every site, we'll now have to measure a sample and extrapolate an estimate. As a quickfix last Friday Gergo introduced the sampling of actions (one every thousand actions instead of each action is now recorded). As a result all figures on the actions graph were divided by 1000 overnight, making the line appear to go to 0. If you actually hover over recent days and look at the lest sidebar, you'll see that there are figures (they are kind of useless, though, more on that below).
We're now working on improvements and fixing the graphs: https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/619 The general gist of it is that the figures will be compensated according to the sampling and that the sampling factor will be fine-tuned to only apply to metrics that were responsible for the high traffic.
Unfortunately it looks like the 1:1000 sampling since last Friday was too extreme and is destructive of information, even for the actions that were the most numerous. We knew that such a high sampling factor was going to destroy information for small wikis or metrics with low figures, but even the huge metrics in the millions have become unreliable. I'm saying that because multiplying even the largest figures by 1000 still doesn't give an estimate close to what it was before the change. Which means that the actions graph probably won't be fixable for the period since last Friday until my fixes make it through. Even compensating for the sampling (by multiplying the figures by 1000), the line would jump up and down every day for each metric.
Graphs other than actions are unaffected (they were already sampled). The duration log was also affected, but that one doesn't have graphs yet, as the task to create them has been given low priority in the cycle.
On Mon, May 19, 2014 at 8:43 PM, Fabrice Florin fflorin@wikimedia.orgwrote:
Hi guys,
Does anyone know why the Media Viewer metrics dashboards seem to be stuck with old data from Friday?
http://multimedia-metrics.wmflabs.org/dashboards/mmv
Is there anything we could fiddle with to get the new data to show up?
Thanks for any insights :)
Fabrice
Fabrice Florin Product Manager Wikimedia Foundation
http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Dear Gilles,
Thanks so much for your helpful explanation of what’s causing this issue.
I am glad that you are on top of it. Please let us know what we can do to support you.
Realistically, when do you think we could get metrics dashboards updated again?
We have a big release on Thursday to English, German, Italian and Russian Wikipedias, and it would be best if they could be working by then, so we can track the impact of this major deployment.
Fabrice
On May 20, 2014, at 5:21 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Media Viewer's usage of EventLogging grew considerably because of all the tracking we're doing: http://lists.wikimedia.org/pipermail/analytics/2014-May/002053.html and Nuria asked us to reduce the rate.
Due to the global size we're dealing with, instead of logging every action on every site, we'll now have to measure a sample and extrapolate an estimate. As a quickfix last Friday Gergo introduced the sampling of actions (one every thousand actions instead of each action is now recorded). As a result all figures on the actions graph were divided by 1000 overnight, making the line appear to go to 0. If you actually hover over recent days and look at the lest sidebar, you'll see that there are figures (they are kind of useless, though, more on that below).
We're now working on improvements and fixing the graphs: https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/619 The general gist of it is that the figures will be compensated according to the sampling and that the sampling factor will be fine-tuned to only apply to metrics that were responsible for the high traffic.
Unfortunately it looks like the 1:1000 sampling since last Friday was too extreme and is destructive of information, even for the actions that were the most numerous. We knew that such a high sampling factor was going to destroy information for small wikis or metrics with low figures, but even the huge metrics in the millions have become unreliable. I'm saying that because multiplying even the largest figures by 1000 still doesn't give an estimate close to what it was before the change. Which means that the actions graph probably won't be fixable for the period since last Friday until my fixes make it through. Even compensating for the sampling (by multiplying the figures by 1000), the line would jump up and down every day for each metric.
Graphs other than actions are unaffected (they were already sampled). The duration log was also affected, but that one doesn't have graphs yet, as the task to create them has been given low priority in the cycle.
On Mon, May 19, 2014 at 8:43 PM, Fabrice Florin fflorin@wikimedia.org wrote: Hi guys,
Does anyone know why the Media Viewer metrics dashboards seem to be stuck with old data from Friday?
http://multimedia-metrics.wmflabs.org/dashboards/mmv
Is there anything we could fiddle with to get the new data to show up?
Thanks for any insights :)
Fabrice
Fabrice Florin Product Manager Wikimedia Foundation
http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
_______________________________
Fabrice Florin Product Manager Wikimedia Foundation
Realistically, when do you think we could get metrics dashboards updated again?
I can't make predictions at this point, as my changesets are quite big and haven't been reviewed yet. I should have a better idea once the first pass of review has happened.
On Tue, May 20, 2014 at 6:30 PM, Fabrice Florin fflorin@wikimedia.orgwrote:
Dear Gilles,
Thanks so much for your helpful explanation of what’s causing this issue.
I am glad that you are on top of it. Please let us know what we can do to support you.
Realistically, when do you think we could get metrics dashboards updated again?
We have a big release on Thursday to English, German, Italian and Russian Wikipedias, and it would be best if they could be working by then, so we can track the impact of this major deployment.
Fabrice
On May 20, 2014, at 5:21 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Media Viewer's usage of EventLogging grew considerably because of all the tracking we're doing: http://lists.wikimedia.org/pipermail/analytics/2014-May/002053.html and Nuria asked us to reduce the rate.
Due to the global size we're dealing with, instead of logging every action on every site, we'll now have to measure a sample and extrapolate an estimate. As a quickfix last Friday Gergo introduced the sampling of actions (one every thousand actions instead of each action is now recorded). As a result all figures on the actions graph were divided by 1000 overnight, making the line appear to go to 0. If you actually hover over recent days and look at the lest sidebar, you'll see that there are figures (they are kind of useless, though, more on that below).
We're now working on improvements and fixing the graphs: https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/619The general gist of it is that the figures will be compensated according to the sampling and that the sampling factor will be fine-tuned to only apply to metrics that were responsible for the high traffic.
Unfortunately it looks like the 1:1000 sampling since last Friday was too extreme and is destructive of information, even for the actions that were the most numerous. We knew that such a high sampling factor was going to destroy information for small wikis or metrics with low figures, but even the huge metrics in the millions have become unreliable. I'm saying that because multiplying even the largest figures by 1000 still doesn't give an estimate close to what it was before the change. Which means that the actions graph probably won't be fixable for the period since last Friday until my fixes make it through. Even compensating for the sampling (by multiplying the figures by 1000), the line would jump up and down every day for each metric.
Graphs other than actions are unaffected (they were already sampled). The duration log was also affected, but that one doesn't have graphs yet, as the task to create them has been given low priority in the cycle.
On Mon, May 19, 2014 at 8:43 PM, Fabrice Florin fflorin@wikimedia.orgwrote:
Hi guys,
Does anyone know why the Media Viewer metrics dashboards seem to be stuck with old data from Friday?
http://multimedia-metrics.wmflabs.org/dashboards/mmv
Is there anything we could fiddle with to get the new data to show up?
Thanks for any insights :)
Fabrice
Fabrice Florin Product Manager Wikimedia Foundation
http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Fabrice Florin Product Manager Wikimedia Foundation
http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
On Tue, May 20, 2014 at 5:21 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Unfortunately it looks like the 1:1000 sampling since last Friday was too extreme and is destructive of information, even for the actions that were the most numerous. We knew that such a high sampling factor was going to destroy information for small wikis or metrics with low figures, but even the huge metrics in the millions have become unreliable. I'm saying that because multiplying even the largest figures by 1000 still doesn't give an estimate close to what it was before the change. Which means that the actions graph probably won't be fixable for the period since last Friday until my fixes make it through. Even compensating for the sampling (by multiplying the figures by 1000), the line would jump up and down every day for each metric.
There is a big spike every weekend in the unsampled logs as well, so the numbers jumping around between Friday and now is not necessarily a sampling artifact.
Still, the sampling ratio was chosen aggressively and could be decreased if needed:
10:46 < ori> operationally i can tell you that 1:1000 and even 1:100 are totally fine
Is there a "scientific" way of choosing the right sampling? Like set a certain standard deviation we should be aiming for, and then work backwards from that?
Nuria already said that for percentiles we want 1000 events per bucket, which means 100.000 events daily for a 99th percentile graph (that's the highest we have currently), we were getting ~3M duration log events a day, so the conservative choice would be 1:10, after which MultimediaViewerDuration logs would account for ~1% of the EventLogging traffic.
From action events, we were getting about 15M a day, and we only use them
to show total counts (daily number of clicks etc). How do we tell when the sampling ratio is right for that?
There is a big spike every weekend in the unsampled logs as well, so the numbers jumping around between Friday and now is not necessarily a sampling artifact.
Look at the figures closely, they're ridiculous. French wikipedia image views that have been very stable lately supposedly doubled over the weekend. Dutch wikipedia image views, which was steadily declining since launch, would also have tripled overnight.
Quoting Dan:
The load wasn't too much of a problem.
How do we tell when the sampling ratio is right for that?
I think you're overthinking it, you seem to be looking for the perfect figure. Let's start with an educated guess on the side of the spectrum that is less likely to have us lose data (which is what I've done for my config changeset), even if it means we are likely to overuse EventLogging. Then we'll see what we have and readjust accordingly, until we have both accurate data and reasonable EventLogging usage. There's no point trying to get it perfect the first time, it's more urgent to have accurate data again and then we'll reduce the usage wherever we can without compromising the accuracy.
On Tue, May 20, 2014 at 8:03 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, May 20, 2014 at 5:21 AM, Gilles Dubuc gilles@wikimedia.orgwrote:
Unfortunately it looks like the 1:1000 sampling since last Friday was too extreme and is destructive of information, even for the actions that were the most numerous. We knew that such a high sampling factor was going to destroy information for small wikis or metrics with low figures, but even the huge metrics in the millions have become unreliable. I'm saying that because multiplying even the largest figures by 1000 still doesn't give an estimate close to what it was before the change. Which means that the actions graph probably won't be fixable for the period since last Friday until my fixes make it through. Even compensating for the sampling (by multiplying the figures by 1000), the line would jump up and down every day for each metric.
There is a big spike every weekend in the unsampled logs as well, so the numbers jumping around between Friday and now is not necessarily a sampling artifact.
Still, the sampling ratio was chosen aggressively and could be decreased if needed:
10:46 < ori> operationally i can tell you that 1:1000 and even 1:100 are totally fine
Is there a "scientific" way of choosing the right sampling? Like set a certain standard deviation we should be aiming for, and then work backwards from that?
Nuria already said that for percentiles we want 1000 events per bucket, which means 100.000 events daily for a 99th percentile graph (that's the highest we have currently), we were getting ~3M duration log events a day, so the conservative choice would be 1:10, after which MultimediaViewerDuration logs would account for ~1% of the EventLogging traffic.
From action events, we were getting about 15M a day, and we only use them to show total counts (daily number of clicks etc). How do we tell when the sampling ratio is right for that?
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
[gerco]From action events, we were getting about 15M a day, and we only use them to show total counts (daily number of clicks etc). How do we tell when the sampling ratio is right for that?
[gilles] I think you're overthinking it, you seem to be looking for the perfect figure. Let's start with an educated guess
Right. What I have done in the past for situations similar to this one is to log heavily at the beginning to get a grasp for the volume of data (we have already done this, if not intentionally) and after you reduce the rate somewhat, gather data for some time and see at what level of sampling the data is no longer erratic (i.e. absolute values when multiplied by sampling rate do not oscillate too much). We just need a configuration file that throttles the client login, from the code changes that I saw flying by Friday I take that we can also modify this sampling config pretty easy in mediawiki code.
On Wed, May 21, 2014 at 10:46 AM, Gilles Dubuc gilles@wikimedia.org wrote:
There is a big spike every weekend in the unsampled logs as well, so the numbers jumping around between Friday and now is not necessarily a sampling artifact.
Look at the figures closely, they're ridiculous. French wikipedia image views that have been very stable lately supposedly doubled over the weekend. Dutch wikipedia image views, which was steadily declining since launch, would also have tripled overnight.
Quoting Dan:
The load wasn't too much of a problem.
How do we tell when the sampling ratio is right for that?
I think you're overthinking it, you seem to be looking for the perfect figure. Let's start with an educated guess on the side of the spectrum that is less likely to have us lose data (which is what I've done for my config changeset), even if it means we are likely to overuse EventLogging. Then we'll see what we have and readjust accordingly, until we have both accurate data and reasonable EventLogging usage. There's no point trying to get it perfect the first time, it's more urgent to have accurate data again and then we'll reduce the usage wherever we can without compromising the accuracy.
On Tue, May 20, 2014 at 8:03 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, May 20, 2014 at 5:21 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Unfortunately it looks like the 1:1000 sampling since last Friday was too extreme and is destructive of information, even for the actions that were the most numerous. We knew that such a high sampling factor was going to destroy information for small wikis or metrics with low figures, but even the huge metrics in the millions have become unreliable. I'm saying that because multiplying even the largest figures by 1000 still doesn't give an estimate close to what it was before the change. Which means that the actions graph probably won't be fixable for the period since last Friday until my fixes make it through. Even compensating for the sampling (by multiplying the figures by 1000), the line would jump up and down every day for each metric.
There is a big spike every weekend in the unsampled logs as well, so the numbers jumping around between Friday and now is not necessarily a sampling artifact.
Still, the sampling ratio was chosen aggressively and could be decreased if needed:
10:46 < ori> operationally i can tell you that 1:1000 and even 1:100 are totally fine
Is there a "scientific" way of choosing the right sampling? Like set a certain standard deviation we should be aiming for, and then work backwards from that?
Nuria already said that for percentiles we want 1000 events per bucket, which means 100.000 events daily for a 99th percentile graph (that's the highest we have currently), we were getting ~3M duration log events a day, so the conservative choice would be 1:10, after which MultimediaViewerDuration logs would account for ~1% of the EventLogging traffic.
From action events, we were getting about 15M a day, and we only use them to show total counts (daily number of clicks etc). How do we tell when the sampling ratio is right for that?
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
multimedia@lists.wikimedia.org