Re: [Multimedia] Filtering out outliers in data used to generate tsvs

16 Apr 2014

Sending it to the right Analytics email address....

On Wed, Apr 16, 2014 at 1:22 PM, Gilles Dubuc &lt;gilles(a)wikimedia.org&gt; wrote:

...
  Including the analytics team in case they have a
magical solution to our
 problem.

 Currently, our graphs display the mean and standard deviation of metrics,
 as provided in "mean" and "std" columns coming from our tsvs,
generated
 based on EventLogging data:
 http://multimedia-metrics.wmflabs.org/dashboards/mmv  However we already
 see that extreme outliers can make the standard deviation and mean
 skyrocket and as a result make the graphs useless for some metrics. See
 France, for example, for which a single massive value was able to skew the
 map into making the country look problematic:

http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_p…
no performance issue with France, but the graph suggests that is
 the case because of that one outlier.

 Ideally, instead of using the mean for our graphs, we would be using what
 is called the "trimmed mean", i.e. the mean of all values excluding the
 upper and lower X percentiles. Unfortunately, MariaDB doesn't provide that
 as a function and calculating it with SQL can be surprisingly complicated,
 especially since we often have to group values for a given column. The best
 alternative I could come up with so far for our geographical queries was to
 exclude values that differ more than X times the standard deviation from
 the mean. It kind of flattens the mean. It's not ideal, because I think
 that in the context of our graphs it makes things look like they perform
 better than they really do.

 I think the main issue at the moment is that we're using a shell script to
 pipe a SQL request directly from db1047 to a tsv file. That limits us to
 one giant SQL query, and since we don't have the ability to create
 temporary tables on the log database with the research_prod user, we can't
 preprocess the data in multiple queries to filter out the upper and lower
 percentiles. The trimmed mean would be kind of feasible as a single
 complicated query if it wasn't for the GROUP BY:
 http://stackoverflow.com/a/8909568

 So, my latest idea for a solution is to write a python script that will
 import the section (last X days) of data from the EventLogging tables that
 we're interested in into a temporary sqlite database, then proceed with
 removing the upper and lower percentiles of the data, according to any
 column grouping that might be necessary. And finally, once the data
 preprocessing is done in sqlite, run similar queries as before to export
 the mean, standard deviation, etc. for given metrics to tsvs. I think using
 sqlite is cleaner than doing the preprocessing on db1047 anyway.

 It's quite an undertaking, it basically means rewriting all our current
 SQL => TSV conversion. The ability to use more steps in the conversion
 means that we'd be able to have simpler, more readable SQL queries. It
 would also be a good opportunity to clean up the giant performance query
 with a bazillion JOINS:

https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e…
can actually be divided into several data sources all used in the
 same graph.

 Does that sound like a good idea, or is there a simpler solution out there
 that someone can think of?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Multimedia] Filtering out outliers in data used to generate tsvs