Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

20 Apr 2014

...

 Many images still take a much longer time to load in practice, as reported
 by beta users around the world

Anecdotal evidence doesn't invalidate data collected directly by people's
web browsers. People's impression isn't as reliable as the data we're
measuring. The reason why we're collecting data this way is to that we can
separate the facts from the feeling people might have. Since we're talking
about an average, there are undeniably slower loads for certain people
(soon shown as histograms), but I don't see any reason to doubt the
averages collected based on people's comments.

For a dozen of people who felt the need to comment that it was slow for
them, there could have been hundreds or thousands who were satisfied and
didn't say a thing. In my experience people who are happy or unaffected by
something are a lot less likely to engage with a feedback survey.

Can we really assume that the mean image load time in India is 691
...
  milliseconds?

Yes, that data is very real, for the API map, India's figures are
calculated over 12,209 measured requests, 5,158 unique IP addresses, none
of which have bot-like user agents strings.

But could there also be some bots or other traffic which could be
...
  distorting the results?

Bots are valid concern, so I did some digging. Some bots masquerade as real
browsers (not serious search engines like google/yahoo, etc. which make up
most of the bot traffic), but since we're not seeing any non-masquerading
bots at all for India data, I seriously doubt there is any bot traffic at
this time that would impact the results for that country.

Looking at all countries, I only see 10 hits from a googlebot user agent
string, but with such a low amount it's hard to say if it really is a
googlebot (and not someone/something pretending to be it...).  In fact,
given the low bandwidth on those particular hits (24kb/s on an image load
that was a varnish hit) and the fact that their IPs appeared to come from
Poland and Bangladesh, I doubt it was really google.

While it's undeniable that rural areas one might visit during travels still
suffer from low internet speed, the majority of the world's population now
lives in cities:
http://www.un.org/en/development/desa/population/publications/urbanization/…
the average broadband speed worldwide is probably much higher nowadays
than most people think: http://www.netindex.com/ And dial-up is rapidly
disappearing:
http://www.pewinternet.org/data-trend/internet-use/connection-type/ Slow
internet speed is a reality for a lot of people, but not for the majority
of people. I'm not surprised by the average results we're seeing. I agree
that this rapid change in recent years can be counter-intuitive when you're
used to traveling to rural locations.

Any practical recommendations for addressing this concern?
...

Can the users who've been complaining about speed be contacted? That would
allow us to verify whether the bad experience is consistent for them, we
could measure it directly and even compare it to their general internet
speed.

As far as performance and stats improvements are concerned, we've been over
it several times and I think everything that could be done is already
implemented, filed or on its way.

And let's not forget that the status quo (opening the File: page) might be
just as slow for those people. They might just not realize it, because most
of the time spent loading that page shows you a blank tab. Now that the
"versus" test has been running on cloudbees for a couple of days, targeting
mediawiki.org, we can see that the file page is slower on average:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#media_viewer_vs_file_p…
wasn't the case a couple of weeks back, but we've made a number of
improvements since.

That's why I think it's important to do some real measurements on users
that bring up this issue. If we're not already doing it, we should
encourage them to optionally enter their email address for the purpose of
investigating issues further.

On Sat, Apr 19, 2014 at 9:44 PM, Fabrice Florin &lt;fflorin(a)wikimedia.org&gt;wrote;wrote:

> Thanks to everyone for this great teamwork!
...
  > The updated geographical performance
dashboards which Gilles and Mark just
> posted paint a more optimistic picture than before, which is encouraging:
...
  >
http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_p…
...
  > However, these extremely fast load times do
not match what we are hearing
> from our users -- or even our own experience on slower connections. Many
> images still take a much longer time to load in practice, as reported by
> beta users around the world, from Brazil to Hungary.
...
  > Can we really assume that the mean image
load time in India is 691
> milliseconds? Seems way too fast, based on my experience traveling in Asia
> a few weeks ago -- where images could take a very long time to load, if at
> all.
...
  > As Gergo pointed out, these early results
may be because our first beta
> testers may have some faster connections than average users. But could
> there also be some bots or other traffic which could be distorting the
> results?
...
  > I know that we are working next on
histograms that will give us a better
> sense of how outliers are performing against average users. Can't wait for
> that.
...
  > But I am still concerned that this chart
may be painting a much rosier
> picture than what's actually going on in the real world.
...
  > Any practical recommendations for
addressing this concern? We want to know
> what's really happening for average users, so we can determine whether or
> not regions with slow connections like India should consider making this
> feature opt-in, rather than opt-out.
...
  > Thanks again to you all for helping us gain
more clarity on this critical
> issue :)
...

...

> Fabrice
...

...

> On Apr 18, 2014, at 11:16 AM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt; wrote:
...
  > Mark deployed the change, the mean and
standard deviation on the "Overall
> network performance" and "Geographical network performance" tabs are
now
> geometric:
...
  >
http://multimedia-metrics.wmflabs.org/dashboards/mmv
...
  > These charts and maps now make a lot more
sense! Next I'll be working on
> distribution histograms, so that we can see the outlier values that are now
> excluded from those graphs.
...
  > Thanks again Aaron, thanks to you these
visualizations have become truly
> useful and meaningful, in the way they were meant to be.
...

...

> On Thu, Apr 17, 2014 at 6:13 PM, Aaron Halfaker
&lt;ahalfaker(a)wikimedia.org&gt;wrote;wrote:
...
  >> Yikes!  Good catch.
>...

>...

>> On Thu, Apr 17, 2014 at 11:12 AM, Gilles Dubuc
&lt;gilles(a)wikimedia.org&gt;wrote;wrote:
>...
  >>> A solution to this problem is to
generate a geometric mean[2] instead.
>>>...

>>...
  >>> Thanks a lot for the help, it
literally instantly solved my problem!
>>...
  >>> There
was a small mistake in the order of functions in your example, for
>>> the record it should be:
>>...
  >>>
EXP(AVG(LOG(event_total))) AS geometric_mean
>>...
  >>> And
conveniently the geometric standard deviation can be calculated the
>>> same way:
>>...
  >>>
EXP(STDDEV(LOG(event_total))) AS geometric_stddev
>>...
  >>> I put it
to the test on a specific set of data where we had a huge
>>> outlier, and for that data it seems equivalent to excluding the lower and
>>> upper 10 percentiles, which is exactly what I was after.
>>...

>>...

>>...

>>...

>>...
  >>> On Wed, Apr 16, 2014 at 4:24 PM,
Aaron Halfaker &lt;ahalfaker(a)wikimedia.org
>>> > wrote:
>>...
  >>>> Hi
Gilles,
>>>...
  >>>> I
think I know just the thing you're looking for.
>>>...
  >>>> It
turns out that much of this performance data is log-normally
>>>> distributed[1].    Log-normal distributions tend to have a hockey stick
>>>> shape where most of the values are close to zero, but occasionally very
>>>> large values appear[3].  Taking the mean of a log-normal distributions
tend
>>>> to be sensitive to outliers like the ones you describe.
>>>...
  >>>> A
solution to this problem is to generate a geometric mean[2] instead.
>>>>  One convenient thing about log-normal data is that if you log() it, it
>>>> becomes normal[4] -- and not sensitive to outliers in the usual way. 
Also
>>>> convenient, geometric means are super easy to generate.  All you need to
do
>>>> is this: (1) pass all of the data through log() (2) pass the same data
>>>> through mean() (or avg() -- whatever) (3) pass the result through exp().
>>>>  The best thing about this is that you can do it in MySQL.
>>>...
  >>>> For
example:
>>>...
  >>>>
SELECT
>>>>   country,
>>>>   mean(timings) AS regular_mean,
>>>>   exp(log(mean(timings)) AS geomteric_mean
>>>> FROM log.WhateverSchemaYouveGot
>>>> GROUP BY country
>>>...

>>>...
  >>>> 1.
https://en.wikipedia.org/wiki/Log-normal_distribution
>>>> 2. https://en.wikipedia.org/wiki/Geometric_mean
>>>> 3. See distribution.log_normal.svg
(24K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=a…...
  >>>> 4. See
distribution.log_normal.logged.svg
(33K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=a…...

>>>...

>>>> -Aaron
>>>...
  >>>> On
Wed, Apr 16, 2014 at 8:42 AM, Dan Andreescu <
>>>> dandreescu(a)wikimedia.org&gt; wrote:
>>>...
  >>>>> 
So, my latest idea for a solution is to write a python script that
>>>>>>> will import the section (last X days) of data from the
EventLogging tables
>>>>>>> that we're interested in into a temporary sqlite
database, then proceed
>>>>>>> with removing the upper and lower percentiles of the data,
according to any
>>>>>>> column grouping that might be necessary. And finally, once
the data
>>>>>>> preprocessing is done in sqlite, run similar queries as
before to export
>>>>>>> the mean, standard deviation, etc. for given metrics to tsvs.
I think using
>>>>>>> sqlite is cleaner than doing the preprocessing on db1047
anyway.
>>>>>>...

>>>>>>> It's quite an undertaking, it basically means
rewriting all our
>>>>>>> current SQL => TSV conversion. The ability to use more
steps in the
>>>>>>> conversion means that we'd be able to have simpler, more
readable SQL
>>>>>>> queries. It would also be a good opportunity to clean up the
giant
>>>>>>> performance query with a bazillion JOINS:
>>>>>>>
https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e…
can actually be divided into several data sources all used in the
>>>>>>> same graph.
>>>>>>...

>>>>>>> Does that sound like a good idea, or is there a simpler
solution out
>>>>>>> there that someone can think of?
>>>>>>...

>>>>>...
  >>>>>
Well, I think this sounds like we need to seriously evaluate how
>>>>> people are using EventLogging data and provide this sort of analysis
as a
>>>>> feature.  We'd have to hear from more people but I bet it's
the right thing
>>>>> to do long term.
>>>>...
  >>>>>
Meanwhile, "simple" is highly subjective here.  If it was me, I'd
>>>>> clean up the indentation of that giant SQL query you have, then
maybe
>>>>> figure out some ways to make it faster, then be happy as a clam.  So
if
>>>>> sql-lite is the tool you feel happy as a clam with, then that sounds
like a
>>>>> great solution.  Alternatives would be python, php, etc.  I forgot
if
>>>>> pandas was allowed where you're working but that's a great
python library
>>>>> that would make what you're talking about fairly easy.
>>>>...
  >>>>>
Another thing for us to seriously consider is PostgreSQL.  This has
>>>>> proper f-ing temporary tables and supports actual people doing actual
work
>>>>> with databases.  We could dump data, especially really simple schemas
like
>>>>> EventLogging, into PostgreSQL for analysis.
>>>>...
  >>>>>
_______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>...

>>>>...

>>>...

>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>...

>>>...

>>...

>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>...

>>...

>...

>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>...

>...

> _______________________________________________
> Multimedia mailing list
> Multimedia(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/multimedia
...

...

> _______________________________
...
  > Fabrice Florin
> Product Manager
> Wikimedia Foundation
...
  >
http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
...

...

...

...

> _______________________________________________
> Multimedia mailing list
> Multimedia(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/multimedia
...

...

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs