Re: [EE] [Analytics] Notifications Live on Spanish, Dutch, Hebrew and other Wikipedias

19 Sep 2013

A little bit of context on these dashboards and what part of the process is
"manual".

data sources
The graphs primarily use data obtained by querying the EventLogging db or the private SQL
slaves (there are some exceptions like the revert graphs, which involve more
pre-processing). Refreshing the data typically depends on scripts run hourly or daily via
cronjobs on stat1. The datasets are then rsync'ed to stat1001. The dashboards live on
multiple Limn instances (typically set up on labs and controlled by different teams) which
host the datasource, graph and dashboard definitions. 

graph customization
it's no big deal to generate multiple dashboards in a scripted way (that's what we
do when a new feature is deployed on a number of projects). What's tricky is the fact
that different projects may have different feature sets enabled, each feature may be
configured differently on a per-project basis, and in some cases different parameters
(such as project-specific cutoff dates) need to be set for segmenting the data.

It's obvious that this process doesn't scale well, pulling data from 800 slaves
can be a pain (Oliver recently shared some really good thoughts on this) and it's hard
to keep track of what data exists for each project or where it's hosted. Centralizing
the generation of the datasets and the corresponding graphs will enormously simplify the
process of creating and discovering dashboards and I think we should start from the
low-hanging fruit of EventLogging data. EventLogging produces well-defined,
project-agnostic datasets that can be written natively into different stores (including
SQL, Redis, Hadoop or flat files). So here's what we could do:

1. we start producing dashboards for core metrics for all projects by ingesting
EventLogging data into Hadoop. 
2. next we experiment importing data from core MediaWiki tables (that by definition exist
on each project) and have no problem of graph customization/fine-tuning.
3. finally, we define a registry of what features are enabled on each project and
selectively import from the production DB tables that are needed to generate the data and
the corresponding parameters.

Does this approach make sense and is there anything that prevents us from experimenting
with step 1?

Dario

On Sep 18, 2013, at 7:13 PM, Diederik van Liere &lt;dvanliere(a)wikimedia.org&gt; wrote:

...
  On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen
&lt;mflaschen(a)wikimedia.org&gt; wrote:
 On 09/18/2013 08:30 PM, Fabrice Florin wrote:
 It's a lack of automated tools.

 Right now, Dario has to create each of them manually and it's not
 practical for him to support hundreds of sites, given his workload.

 Yeah, it certainly doesn't make sense to do them all manually.  But I think it would
be great to be able to script this.

 Someday, when we have more resources, our analytics team may be able to
 automate this process, so we can support more sites.

 Agreed, I'm CCing Analytics on this.  For feature requests like this, is it best to
file an enhancement in Bugzilla, email the Analytics list, or something else?
 What exactly is the feature request (automate what process)?
 D 

 Matt Flaschen

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 EE mailing list
 EE(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/ee 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [EE] [Analytics] Notifications Live on Spanish, Dutch, Hebrew and other Wikipedias