[gerco]From
action events, we were getting about 15M a day,
and we only use them to show total counts (daily number of clicks etc).
How do we tell when the sampling ratio is right for that?
[gilles] I think
you're overthinking it, you seem to be looking for the perfect figure. Let's start
with an educated guess
Right. What I have done in the past for situations similar to this one
is to log heavily at the beginning to get a grasp for the volume of
data (we have already done this, if not intentionally) and after you
reduce the rate somewhat, gather data for some time and see at what
level of sampling the data is no longer erratic (i.e. absolute values
when multiplied by sampling rate do not oscillate too much). We just
need a configuration file that throttles the client login, from the
code changes that I saw flying by Friday I take that we can also
modify this sampling config pretty easy in mediawiki code.
On Wed, May 21, 2014 at 10:46 AM, Gilles Dubuc <gilles(a)wikimedia.org> wrote:
There is a big
spike every weekend in the unsampled logs as well, so the
numbers jumping around between Friday and now is not necessarily a sampling
artifact.
Look at the figures closely, they're ridiculous. French wikipedia image
views that have been very stable lately supposedly doubled over the weekend.
Dutch wikipedia image views, which was steadily declining since launch,
would also have tripled overnight.
Quoting Dan:
The load wasn't too much of a problem.
How do we tell when the sampling ratio is right
for that?
I think you're overthinking it, you seem to be looking for the perfect
figure. Let's start with an educated guess on the side of the spectrum that
is less likely to have us lose data (which is what I've done for my config
changeset), even if it means we are likely to overuse EventLogging. Then
we'll see what we have and readjust accordingly, until we have both accurate
data and reasonable EventLogging usage. There's no point trying to get it
perfect the first time, it's more urgent to have accurate data again and
then we'll reduce the usage wherever we can without compromising the
accuracy.
On Tue, May 20, 2014 at 8:03 PM, Gergo Tisza <gtisza(a)wikimedia.org> wrote:
>
> On Tue, May 20, 2014 at 5:21 AM, Gilles Dubuc <gilles(a)wikimedia.org>
> wrote:
>>
>> Unfortunately it looks like the 1:1000 sampling since last Friday was too
>> extreme and is destructive of information, even for the actions that were
>> the most numerous. We knew that such a high sampling factor was going to
>> destroy information for small wikis or metrics with low figures, but even
>> the huge metrics in the millions have become unreliable. I'm saying that
>> because multiplying even the largest figures by 1000 still doesn't give an
>> estimate close to what it was before the change. Which means that the
>> actions graph probably won't be fixable for the period since last Friday
>> until my fixes make it through. Even compensating for the sampling (by
>> multiplying the figures by 1000), the line would jump up and down every day
>> for each metric.
>
>
There is a big spike every weekend in the
unsampled logs as well, so the
numbers jumping around between Friday and now is not necessarily a sampling
artifact.
>
> Still, the sampling ratio was chosen aggressively and could be decreased
> if needed:
>>
>> 10:46 < ori> operationally i can tell you that 1:1000 and even 1:100 are
>> totally fine
>
>
> Is there a "scientific" way of choosing the right sampling? Like set a
> certain standard deviation we should be aiming for, and then work backwards
> from that?
>
> Nuria already said that for percentiles we want 1000 events per bucket,
> which means 100.000 events daily for a 99th percentile graph (that's the
> highest we have currently), we were getting ~3M duration log events a day,
> so the conservative choice would be 1:10, after which
> MultimediaViewerDuration logs would account for ~1% of the EventLogging
> traffic.
>
> From action events, we were getting about 15M a day, and we only use them
> to show total counts (daily number of clicks etc). How do we tell when the
> sampling ratio is right for that?
>
> _______________________________________________
> Multimedia mailing list
> Multimedia(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/multimedia
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics