Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

21 Apr 2014

Thanks for the detailed response, Gilles!

I appreciate your willingness to keep in mind reports from users alongside the image load
data we are collecting.

As you suggest, I will ask legal if we can collect email addresses of users who are
willing to be contacted for follow up questions, so we can dig in a bit more about their
performance issues.

I too would rather rely on actual data than anecdotal reports, but I want to make sure
that the data is reliable. My own experience continues to show long load times that take
seconds, not just milliseconds, on pages like these: 
https://hu.wikipedia.org/wiki/Wikip%C3%A9dia:A_nap_k%C3%A9pe#

For the purposes of calculating total image load from your dashboards, should we still be
adding the API and image performance numbers? That would bring our different data points a
bit closer to each other. :)

I look forward to learning more together about our average users' actual experience,
which may require us to calibrate results from different methods until we have a good
handle on this.

Onward!

Fabrice  

On Apr 21, 2014, at 2:33 AM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt; wrote:

...
  Are the stats reliable though? There is a huge jump a
few days ago, even in the file page loading times. Is that when it was switched over to
Cloudbees?

 Any data on that graph before Match 18th is junk that came from (often partial) runs on
my laptop, at times on internet connections of very questionable quality.

 March 18th onwards is exclusively run on cloudbees. You can see right away that those
cloudbees figures are a lot more stable.

 When we have more data in a few days I'll update the SQL query to remove the
misleading figures that came from local development. In fact we should make sure to avoid
running this test locally against mediawiki.org or any production wiki where EventLogging
is turned on from now on, otherwise we'll pollute the stats.

 On Mon, Apr 21, 2014 at 3:38 AM, Gergo Tisza &lt;gtisza(a)wikimedia.org&gt; wrote:
 On Sun, Apr 20, 2014 at 3:39 AM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt; wrote:
 Any practical recommendations for addressing this concern?

 Can the users who've been complaining about speed be contacted? That would allow us
to verify whether the bad experience is consistent for them, we could measure it directly
and even compare it to their general internet speed.

 I started a separate thread about that; will also reach out to the users on hu.wiki.
Asking for email addresses in the survey would also be good, but we should check if it has
legal implications (collecting private data can be, especially in the EU, a painful
process).

 And let's not forget that the status quo (opening the File: page) might be just as
slow for those people. They might just not realize it, because most of the time spent
loading that page shows you a blank tab. Now that the "versus" test has been
running on cloudbees for a couple of days, targeting mediawiki.org, we can see that the
file page is slower on average:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#media_viewer_vs_file_p…
That wasn't the case a couple of weeks back, but we've made a number of
improvements since.

 According to those stats, MediaViewer with a warm JS cache beats the file page 2 to 1.
That's pretty impressive!

 Are the stats reliable though? There is a huge jump a few days ago, even in the file page
loading times. Is that when it was switched over to Cloudbees?

 _______________________________________________
 Multimedia mailing list
 Multimedia(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/multimedia

 _______________________________________________
 Multimedia mailing list
 Multimedia(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/multimedia 
_______________________________

On Apr 20, 2014, at 3:39 AM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt; wrote:

...
  Many images still take a much longer time to load in
practice, as reported by beta users around the world

 Anecdotal evidence doesn't invalidate data collected directly by people's web
browsers. People's impression isn't as reliable as the data we're measuring.
The reason why we're collecting data this way is to that we can separate the facts
from the feeling people might have. Since we're talking about an average, there are
undeniably slower loads for certain people (soon shown as histograms), but I don't see
any reason to doubt the averages collected based on people's comments.

 For a dozen of people who felt the need to comment that it was slow for them, there could
have been hundreds or thousands who were satisfied and didn't say a thing. In my
experience people who are happy or unaffected by something are a lot less likely to engage
with a feedback survey.

 Can we really assume that the mean image load time in India is 691 milliseconds?

 Yes, that data is very real, for the API map, India's figures are calculated over
12,209 measured requests, 5,158 unique IP addresses, none of which have bot-like user
agents strings.

 But could there also be some bots or other traffic which could be distorting the
results?

 Bots are valid concern, so I did some digging. Some bots masquerade as real browsers (not
serious search engines like google/yahoo, etc. which make up most of the bot traffic), but
since we're not seeing any non-masquerading bots at all for India data, I seriously
doubt there is any bot traffic at this time that would impact the results for that
country.

 Looking at all countries, I only see 10 hits from a googlebot user agent string, but with
such a low amount it's hard to say if it really is a googlebot (and not
someone/something pretending to be it...).  In fact, given the low bandwidth on those
particular hits (24kb/s on an image load that was a varnish hit) and the fact that their
IPs appeared to come from Poland and Bangladesh, I doubt it was really google.

 While it's undeniable that rural areas one might visit during travels still suffer
from low internet speed, the majority of the world's population now lives in cities:
http://www.un.org/en/development/desa/population/publications/urbanization/…
and the average broadband speed worldwide is probably much higher nowadays than most
people think: http://www.netindex.com/ And dial-up is rapidly disappearing:
http://www.pewinternet.org/data-trend/internet-use/connection-type/ Slow internet speed is
a reality for a lot of people, but not for the majority of people. I'm not surprised
by the average results we're seeing. I agree that this rapid change in recent years
can be counter-intuitive when you're used to traveling to rural locations.

 Any practical recommendations for addressing this concern?

 Can the users who've been complaining about speed be contacted? That would allow us
to verify whether the bad experience is consistent for them, we could measure it directly
and even compare it to their general internet speed.

 As far as performance and stats improvements are concerned, we've been over it
several times and I think everything that could be done is already implemented, filed or
on its way.

 And let's not forget that the status quo (opening the File: page) might be just as
slow for those people. They might just not realize it, because most of the time spent
loading that page shows you a blank tab. Now that the "versus" test has been
running on cloudbees for a couple of days, targeting mediawiki.org, we can see that the
file page is slower on average:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#media_viewer_vs_file_p…
That wasn't the case a couple of weeks back, but we've made a number of
improvements since.

 That's why I think it's important to do some real measurements on users that
bring up this issue. If we're not already doing it, we should encourage them to
optionally enter their email address for the purpose of investigating issues further.

 On Sat, Apr 19, 2014 at 9:44 PM, Fabrice Florin &lt;fflorin(a)wikimedia.org&gt; wrote:
 Thanks to everyone for this great teamwork!

 The updated geographical performance dashboards which Gilles and Mark just posted paint a
more optimistic picture than before, which is encouraging:

http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_p…

 However, these extremely fast load times do not match what we are hearing from our users
— or even our own experience on slower connections. Many images still take a much longer
time to load in practice, as reported by beta users around the world, from Brazil to
Hungary. 

 Can we really assume that the mean image load time in India is 691 milliseconds? Seems
way too fast, based on my experience traveling in Asia a few weeks ago — where images
could take a very long time to load, if at all. 

 As Gergo pointed out, these early results may be because our first beta testers may have
some faster connections than average users. But could there also be some bots or other
traffic which could be distorting the results?

 I know that we are working next on histograms that will give us a better sense of how
outliers are performing against average users. Can’t wait for that.

 But I am still concerned that this chart may be painting a much rosier picture than
what’s actually going on in the real world.

 Any practical recommendations for addressing this concern? We want to know what’s really
happening for average users, so we can determine whether or not regions with slow
connections like India should consider making this feature opt-in, rather than opt-out.

 Thanks again to you all for helping us gain more clarity on this critical issue :)

 Fabrice 

 On Apr 18, 2014, at 11:16 AM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt; wrote:

  Mark deployed the change, the mean and standard
deviation on the "Overall network performance" and "Geographical network
performance" tabs are now geometric:

 http://multimedia-metrics.wmflabs.org/dashboards/mmv

 These charts and maps now make a lot more sense! Next I'll be working on distribution
histograms, so that we can see the outlier values that are now excluded from those
graphs.

 Thanks again Aaron, thanks to you these visualizations have become truly useful and
meaningful, in the way they were meant to be.

 On Thu, Apr 17, 2014 at 6:13 PM, Aaron Halfaker &lt;ahalfaker(a)wikimedia.org&gt; wrote:
 Yikes!  Good catch.  

 On Thu, Apr 17, 2014 at 11:12 AM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt; wrote:
 A solution to this problem is to generate a geometric mean[2] instead.

 Thanks a lot for the help, it literally instantly solved my problem!

 There was a small mistake in the order of functions in your example, for the record it
should be:

 EXP(AVG(LOG(event_total))) AS geometric_mean

 And conveniently the geometric standard deviation can be calculated the same way:

 EXP(STDDEV(LOG(event_total))) AS geometric_stddev

 I put it to the test on a specific set of data where we had a huge outlier, and for that
data it seems equivalent to excluding the lower and upper 10 percentiles, which is exactly
what I was after.

 On Wed, Apr 16, 2014 at 4:24 PM, Aaron Halfaker &lt;ahalfaker(a)wikimedia.org&gt; wrote:
 Hi Gilles,

 I think I know just the thing you're looking for.   

 It turns out that much of this performance data is log-normally distributed[1].   
Log-normal distributions tend to have a hockey stick shape where most of the values are
close to zero, but occasionally very large values appear[3].  Taking the mean of a
log-normal distributions tend to be sensitive to outliers like the ones you describe.  

 A solution to this problem is to generate a geometric mean[2] instead.  One convenient
thing about log-normal data is that if you log() it, it becomes normal[4] -- and not
sensitive to outliers in the usual way.  Also convenient, geometric means are super easy
to generate.  All you need to do is this: (1) pass all of the data through log() (2) pass
the same data through mean() (or avg() -- whatever) (3) pass the result through exp(). 
The best thing about this is that you can do it in MySQL.

 For example:

 SELECT
   country,
   mean(timings) AS regular_mean,
   exp(log(mean(timings)) AS geomteric_mean
 FROM log.WhateverSchemaYouveGot
 GROUP BY country

 1. https://en.wikipedia.org/wiki/Log-normal_distribution
 2. https://en.wikipedia.org/wiki/Geometric_mean
 3. See distribution.log_normal.svg (24K)
 4. See distribution.log_normal.logged.svg (33K)

 -Aaron

 On Wed, Apr 16, 2014 at 8:42 AM, Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt; wrote:
 So, my latest idea for a solution is to write a python script that will import the
section (last X days) of data from the EventLogging tables that we're interested in
into a temporary sqlite database, then proceed with removing the upper and lower
percentiles of the data, according to any column grouping that might be necessary. And
finally, once the data preprocessing is done in sqlite, run similar queries as before to
export the mean, standard deviation, etc. for given metrics to tsvs. I think using sqlite
is cleaner than doing the preprocessing on db1047 anyway.

 It's quite an undertaking, it basically means rewriting all our current SQL => TSV
conversion. The ability to use more steps in the conversion means that we'd be able to
have simpler, more readable SQL queries. It would also be a good opportunity to clean up
the giant performance query with a bazillion JOINS:
https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e…
which can actually be divided into several data sources all used in the same graph.

 Does that sound like a good idea, or is there a simpler solution out there that someone
can think of?

 Well, I think this sounds like we need to seriously evaluate how people are using
EventLogging data and provide this sort of analysis as a feature.  We'd have to hear
from more people but I bet it's the right thing to do long term.

 Meanwhile, "simple" is highly subjective here.  If it was me, I'd clean up
the indentation of that giant SQL query you have, then maybe figure out some ways to make
it faster, then be happy as a clam.  So if sql-lite is the tool you feel happy as a clam
with, then that sounds like a great solution.  Alternatives would be python, php, etc.  I
forgot if pandas was allowed where you're working but that's a great python
library that would make what you're talking about fairly easy.

 Another thing for us to seriously consider is PostgreSQL.  This has proper f-ing
temporary tables and supports actual people doing actual work with databases.  We could
dump data, especially really simple schemas like EventLogging, into PostgreSQL for
analysis.

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Multimedia mailing list
 Multimedia(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/multimedia  
 _______________________________

 Fabrice Florin
 Product Manager
 Wikimedia Foundation

 http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)

 _______________________________________________
 Multimedia mailing list
 Multimedia(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/multimedia

 _______________________________________________
 Multimedia mailing list
 Multimedia(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/multimedia 
_______________________________

Fabrice Florin
Product Manager
Wikimedia Foundation

http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs