Re: [Analytics] [EE] Notifications Live on Spanish, Dutch, Hebrew and other Wikipedias

19 Sep 2013

      On Thu, Sep 19, 2013 at 7:25 AM, Andrew Otto otto@wikimedia.org wrote:
...
We have experimented with this in the past and it should not be too hard
to re-enable. I will confirm with Andrew.
There are two ways to do this, I think.

Use udp2log and pipe into Kafka.
Write a Kafka producer endpoint for EventLogging.

I like #2!  And I think Ori does too (we talked about this once before).
 It should be pretty easy to do.
My suggestion: let's start with 1 because that we can do off the bat and
once we have deployed Kafka we migrate it to Kafka. so 1) as the
intermediate solution, 2) as the final solution.
D
...
On Sep 19, 2013, at 7:22 AM, Diederik van Liere dvanliere@wikimedia.org
wrote:
On Wed, Sep 18, 2013 at 8:05 PM, Dario Taraborelli <
dtaraborelli@wikimedia.org> wrote:
...
A little bit of context on these dashboards and what part of the process
is "manual".
*data sources*
The graphs primarily use data obtained by querying the EventLogging db or
the private SQL slaves (there are some exceptions like the revert graphs,
which involve more pre-processing). Refreshing the data typically depends
on scripts run hourly or daily via cronjobs on stat1. The datasets are then
rsync'ed to stat1001. The dashboards live on multiple Limn instances
(typically set up on labs and controlled by different teams) which host the
datasource, graph and dashboard definitions.

*graph customization*
it's no big deal to generate multiple dashboards in a scripted way
(that's what we do when a new feature is deployed on a number of projects).
What's tricky is the fact that different projects may have different
feature sets enabled, each feature may be configured differently on a
per-project basis, and in some cases different parameters (such as
project-specific cutoff dates) need to be set for segmenting the data.
It's obvious that this process doesn't scale well, pulling data from 800
slaves can be a pain (Oliver recently shared some really good thoughts on
this) and it's hard to keep track of what data exists for each project or
where it's hosted. Centralizing the generation of the datasets and the
corresponding graphs will enormously simplify the process of creating and
discovering dashboards and I think we should start from the low-hanging
fruit of EventLogging data. EventLogging produces well-defined,
project-agnostic datasets that can be written natively into different
stores (including SQL, Redis, Hadoop or flat files). So here's what we
could do:

we start producing dashboards for core metrics for all projects by

ingesting EventLogging data into Hadoop.
We have experimented with this in the past and it should not be too hard
to re-enable. I will confirm with Andrew.
...

next we experiment importing data from core MediaWiki tables (that by

definition exist on each project) and have no problem of graph
customization/fine-tuning.
We also experimented with this; we have a tool called Sqoop to import the
data from MySQL to Hadoop. We would need to define which tables we need to
import but that's not hard. I wrote a small tool called sqoopy that will
automatically map MySQL column types to Hive column types and now with the
the labsdb's we can just import from those databases and not have to worry
about PII.
...

finally, we define a registry of what features are enabled on each

project and selectively import from the production DB tables that are
needed to generate the data and the corresponding parameters.
Would it make sense to read this info from the Mediaiwiki LocalSetting.php
file or is that not containing all the relevant info?
...
Does this approach make sense and is there anything that prevents us from
experimenting with step 1?
Nothing technical prevents us as far as I am aware; it's getting it
appropriately prioritized so that we can work on it fast.
...
Dario
On Sep 18, 2013, at 7:13 PM, Diederik van Liere dvanliere@wikimedia.org
wrote:
On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen <
mflaschen@wikimedia.org> wrote:
...
On 09/18/2013 08:30 PM, Fabrice Florin wrote:
...
It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not
practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually.  But I
think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to
...
automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this.  For feature requests like this, is
it best to file an enhancement in Bugzilla, email the Analytics list, or
something else?
What exactly is the feature request (automate what process)?
D
...
Matt Flaschen
______________________________**_________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics

EE mailing list
EE@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ee

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [EE] Notifications Live on Spanish, Dutch, Hebrew and other Wikipedias