Re: [Analytics] Filtering out outliers in data used to generate tsvs

18 Apr 2014

Mark deployed the change, the mean and standard deviation on the "Overall
network performance" and "Geographical network performance" tabs are now
geometric:

http://multimedia-metrics.wmflabs.org/dashboards/mmv

These charts and maps now make a lot more sense! Next I'll be working on
distribution histograms, so that we can see the outlier values that are now
excluded from those graphs.

Thanks again Aaron, thanks to you these visualizations have become truly
useful and meaningful, in the way they were meant to be.

On Thu, Apr 17, 2014 at 6:13 PM, Aaron Halfaker &lt;ahalfaker(a)wikimedia.org&gt;wrote;wrote:

...
  Yikes!  Good catch.

 On Thu, Apr 17, 2014 at 11:12 AM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt;wrote;wrote:

  A solution to this problem is to generate a
geometric mean[2] instead.

 Thanks a lot for the help, it literally instantly solved my problem!

 There was a small mistake in the order of functions in your example, for
 the record it should be:

 EXP(AVG(LOG(event_total))) AS geometric_mean

 And conveniently the geometric standard deviation can be calculated the
 same way:

 EXP(STDDEV(LOG(event_total))) AS geometric_stddev

 I put it to the test on a specific set of data where we had a huge
 outlier, and for that data it seems equivalent to excluding the lower and
 upper 10 percentiles, which is exactly what I was after.

 On Wed, Apr 16, 2014 at 4:24 PM, Aaron Halfaker &lt;ahalfaker(a)wikimedia.org&gt;wrote;wrote:

> Hi Gilles,
  > I think I know just the thing
you're looking for.
  > It turns out that much of this
performance data is log-normally
> distributed[1].    Log-normal distributions tend to have a hockey stick
> shape where most of the values are close to zero, but occasionally very
> large values appear[3].  Taking the mean of a log-normal distributions tend
> to be sensitive to outliers like the ones you describe.
  > A solution to this problem is to
generate a geometric mean[2] instead.
>  One convenient thing about log-normal data is that if you log() it, it
> becomes normal[4] -- and not sensitive to outliers in the usual way.  Also
> convenient, geometric means are super easy to generate.  All you need to do
> is this: (1) pass all of the data through log() (2) pass the same data
> through mean() (or avg() -- whatever) (3) pass the result through exp().
>  The best thing about this is that you can do it in MySQL.
  > For example:
  > SELECT
>   country,
>   mean(timings) AS regular_mean,
>   exp(log(mean(timings)) AS geomteric_mean
> FROM log.WhateverSchemaYouveGot
> GROUP BY country

> 1. https://en.wikipedia.org/wiki/Log-normal_distribution
> 2. https://en.wikipedia.org/wiki/Geometric_mean
> 3. See distribution.log_normal.svg
(24K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=a…  > 4. See distribution.log_normal.logged.svg
(33K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=a…  

> -Aaron
  > On Wed, Apr 16, 2014 at 8:42 AM, Dan
Andreescu &lt;dandreescu(a)wikimedia.org
> > wrote:
  >>  So, my latest idea for a solution
is to write a python script that
>>>> will import the section (last X days) of data from the EventLogging
tables
>>>> that we're interested in into a temporary sqlite database, then
proceed
>>>> with removing the upper and lower percentiles of the data, according to
any
>>>> column grouping that might be necessary. And finally, once the data
>>>> preprocessing is done in sqlite, run similar queries as before to export
>>>> the mean, standard deviation, etc. for given metrics to tsvs. I think
using
>>>> sqlite is cleaner than doing the preprocessing on db1047 anyway.
>>>  >>>> It's quite
an undertaking, it basically means rewriting all our
>>>> current SQL => TSV conversion. The ability to use more steps in the
>>>> conversion means that we'd be able to have simpler, more readable
SQL
>>>> queries. It would also be a good opportunity to clean up the giant
>>>> performance query with a bazillion JOINS:
>>>>
https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e…
can actually be divided into several data sources all used in the
>>>> same graph.
>>>  >>>> Does that
sound like a good idea, or is there a simpler solution out
>>>> there that someone can think of?
>>>  
>>  >> Well, I think this sounds like we need to seriously
evaluate how people
>> are using EventLogging data and provide this sort of analysis as a feature.
>>  We'd have to hear from more people but I bet it's the right thing to do
>> long term.
>  >> Meanwhile, "simple"
is highly subjective here.  If it was me, I'd clean
>> up the indentation of that giant SQL query you have, then maybe figure out
>> some ways to make it faster, then be happy as a clam.  So if sql-lite is
>> the tool you feel happy as a clam with, then that sounds like a great
>> solution.  Alternatives would be python, php, etc.  I forgot if pandas was
>> allowed where you're working but that's a great python library that
would
>> make what you're talking about fairly easy.
>  >> Another thing for us to
seriously consider is PostgreSQL.  This has
>> proper f-ing temporary tables and supports actual people doing actual work
>> with databases.  We could dump data, especially really simple schemas like
>> EventLogging, into PostgreSQL for analysis.
>  >>
_______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>  
>  
  >
_______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Filtering out outliers in data used to generate tsvs