Including the analytics team in case they have a magical solution to our problem.
Currently, our graphs display the mean and standard deviation of metrics, as provided in "mean" and "std" columns coming from our tsvs, generated based on EventLogging data:
http://multimedia-metrics.wmflabs.org/dashboards/mmv However we already see that extreme outliers can make the standard deviation and mean skyrocket and as a result make the graphs useless for some metrics. See France, for example, for which a single massive value was able to skew the map into making the country look problematic:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab There's no performance issue with France, but the graph suggests that is the case because of that one outlier.
Ideally, instead of using the mean for our graphs, we would be using what is called the "trimmed mean", i.e. the mean of all values excluding the upper and lower X percentiles. Unfortunately, MariaDB doesn't provide that as a function and calculating it with SQL can be surprisingly complicated, especially since we often have to group values for a given column. The best alternative I could come up with so far for our geographical queries was to exclude values that differ more than X times the standard deviation from the mean. It kind of flattens the mean. It's not ideal, because I think that in the context of our graphs it makes things look like they perform better than they really do.
I think the main issue at the moment is that we're using a shell script to pipe a SQL request directly from db1047 to a tsv file. That limits us to one giant SQL query, and since we don't have the ability to create temporary tables on the log database with the research_prod user, we can't preprocess the data in multiple queries to filter out the upper and lower percentiles. The trimmed mean would be kind of feasible as a single complicated query if it wasn't for the GROUP BY: