On 09/18/2013 08:30 PM, Fabrice Florin wrote:
It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually. But I think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else?
Matt Flaschen
On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen mflaschen@wikimedia.orgwrote:
On 09/18/2013 08:30 PM, Fabrice Florin wrote:
It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually. But I think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to
automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else?
What exactly is the feature request (automate what process)? D
Matt Flaschen
______________________________**_________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics
A little bit of context on these dashboards and what part of the process is "manual".
data sources The graphs primarily use data obtained by querying the EventLogging db or the private SQL slaves (there are some exceptions like the revert graphs, which involve more pre-processing). Refreshing the data typically depends on scripts run hourly or daily via cronjobs on stat1. The datasets are then rsync'ed to stat1001. The dashboards live on multiple Limn instances (typically set up on labs and controlled by different teams) which host the datasource, graph and dashboard definitions.
graph customization it's no big deal to generate multiple dashboards in a scripted way (that's what we do when a new feature is deployed on a number of projects). What's tricky is the fact that different projects may have different feature sets enabled, each feature may be configured differently on a per-project basis, and in some cases different parameters (such as project-specific cutoff dates) need to be set for segmenting the data.
It's obvious that this process doesn't scale well, pulling data from 800 slaves can be a pain (Oliver recently shared some really good thoughts on this) and it's hard to keep track of what data exists for each project or where it's hosted. Centralizing the generation of the datasets and the corresponding graphs will enormously simplify the process of creating and discovering dashboards and I think we should start from the low-hanging fruit of EventLogging data. EventLogging produces well-defined, project-agnostic datasets that can be written natively into different stores (including SQL, Redis, Hadoop or flat files). So here's what we could do:
1. we start producing dashboards for core metrics for all projects by ingesting EventLogging data into Hadoop. 2. next we experiment importing data from core MediaWiki tables (that by definition exist on each project) and have no problem of graph customization/fine-tuning. 3. finally, we define a registry of what features are enabled on each project and selectively import from the production DB tables that are needed to generate the data and the corresponding parameters.
Does this approach make sense and is there anything that prevents us from experimenting with step 1?
Dario
On Sep 18, 2013, at 7:13 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen mflaschen@wikimedia.org wrote: On 09/18/2013 08:30 PM, Fabrice Florin wrote: It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually. But I think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else? What exactly is the feature request (automate what process)? D
Matt Flaschen
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
EE mailing list EE@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ee
On Wed, Sep 18, 2013 at 8:05 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
A little bit of context on these dashboards and what part of the process is "manual".
*data sources* The graphs primarily use data obtained by querying the EventLogging db or the private SQL slaves (there are some exceptions like the revert graphs, which involve more pre-processing). Refreshing the data typically depends on scripts run hourly or daily via cronjobs on stat1. The datasets are then rsync'ed to stat1001. The dashboards live on multiple Limn instances (typically set up on labs and controlled by different teams) which host the datasource, graph and dashboard definitions.
*graph customization* it's no big deal to generate multiple dashboards in a scripted way (that's what we do when a new feature is deployed on a number of projects). What's tricky is the fact that different projects may have different feature sets enabled, each feature may be configured differently on a per-project basis, and in some cases different parameters (such as project-specific cutoff dates) need to be set for segmenting the data.
It's obvious that this process doesn't scale well, pulling data from 800 slaves can be a pain (Oliver recently shared some really good thoughts on this) and it's hard to keep track of what data exists for each project or where it's hosted. Centralizing the generation of the datasets and the corresponding graphs will enormously simplify the process of creating and discovering dashboards and I think we should start from the low-hanging fruit of EventLogging data. EventLogging produces well-defined, project-agnostic datasets that can be written natively into different stores (including SQL, Redis, Hadoop or flat files). So here's what we could do:
- we start producing dashboards for core metrics for all projects by
ingesting EventLogging data into Hadoop.
We have experimented with this in the past and it should not be too hard to re-enable. I will confirm with Andrew.
- next we experiment importing data from core MediaWiki tables (that by
definition exist on each project) and have no problem of graph customization/fine-tuning.
We also experimented with this; we have a tool called Sqoop to import the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII.
- finally, we define a registry of what features are enabled on each
project and selectively import from the production DB tables that are needed to generate the data and the corresponding parameters.
Would it make sense to read this info from the Mediaiwiki LocalSetting.php file or is that not containing all the relevant info?
Does this approach make sense and is there anything that prevents us from experimenting with step 1?
Nothing technical prevents us as far as I am aware; it's getting it appropriately prioritized so that we can work on it fast.
Dario
On Sep 18, 2013, at 7:13 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen <mflaschen@wikimedia.org
wrote:
On 09/18/2013 08:30 PM, Fabrice Florin wrote:
It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually. But I think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to
automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else?
What exactly is the feature request (automate what process)? D
Matt Flaschen
______________________________**_________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics
EE mailing list EE@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ee
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We have experimented with this in the past and it should not be too hard to re-enable. I will confirm with Andrew.
There are two ways to do this, I think.
1. Use udp2log and pipe into Kafka. 2. Write a Kafka producer endpoint for EventLogging.
I like #2! And I think Ori does too (we talked about this once before). It should be pretty easy to do.
On Sep 19, 2013, at 7:22 AM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Wed, Sep 18, 2013 at 8:05 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: A little bit of context on these dashboards and what part of the process is "manual".
data sources The graphs primarily use data obtained by querying the EventLogging db or the private SQL slaves (there are some exceptions like the revert graphs, which involve more pre-processing). Refreshing the data typically depends on scripts run hourly or daily via cronjobs on stat1. The datasets are then rsync'ed to stat1001. The dashboards live on multiple Limn instances (typically set up on labs and controlled by different teams) which host the datasource, graph and dashboard definitions.
graph customization it's no big deal to generate multiple dashboards in a scripted way (that's what we do when a new feature is deployed on a number of projects). What's tricky is the fact that different projects may have different feature sets enabled, each feature may be configured differently on a per-project basis, and in some cases different parameters (such as project-specific cutoff dates) need to be set for segmenting the data.
It's obvious that this process doesn't scale well, pulling data from 800 slaves can be a pain (Oliver recently shared some really good thoughts on this) and it's hard to keep track of what data exists for each project or where it's hosted. Centralizing the generation of the datasets and the corresponding graphs will enormously simplify the process of creating and discovering dashboards and I think we should start from the low-hanging fruit of EventLogging data. EventLogging produces well-defined, project-agnostic datasets that can be written natively into different stores (including SQL, Redis, Hadoop or flat files). So here's what we could do:
- we start producing dashboards for core metrics for all projects by ingesting EventLogging data into Hadoop.
We have experimented with this in the past and it should not be too hard to re-enable. I will confirm with Andrew. 2. next we experiment importing data from core MediaWiki tables (that by definition exist on each project) and have no problem of graph customization/fine-tuning. We also experimented with this; we have a tool called Sqoop to import the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII. 3. finally, we define a registry of what features are enabled on each project and selectively import from the production DB tables that are needed to generate the data and the corresponding parameters. Would it make sense to read this info from the Mediaiwiki LocalSetting.php file or is that not containing all the relevant info?
Does this approach make sense and is there anything that prevents us from experimenting with step 1? Nothing technical prevents us as far as I am aware; it's getting it appropriately prioritized so that we can work on it fast.
Dario
On Sep 18, 2013, at 7:13 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen mflaschen@wikimedia.org wrote: On 09/18/2013 08:30 PM, Fabrice Florin wrote: It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually. But I think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else? What exactly is the feature request (automate what process)? D
Matt Flaschen
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
EE mailing list EE@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ee
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Sep 19, 2013 at 7:25 AM, Andrew Otto otto@wikimedia.org wrote:
We have experimented with this in the past and it should not be too hard to re-enable. I will confirm with Andrew.
There are two ways to do this, I think.
- Use udp2log and pipe into Kafka.
- Write a Kafka producer endpoint for EventLogging.
I like #2! And I think Ori does too (we talked about this once before). It should be pretty easy to do.
My suggestion: let's start with 1 because that we can do off the bat and once we have deployed Kafka we migrate it to Kafka. so 1) as the intermediate solution, 2) as the final solution. D
On Sep 19, 2013, at 7:22 AM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Wed, Sep 18, 2013 at 8:05 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
A little bit of context on these dashboards and what part of the process is "manual".
*data sources* The graphs primarily use data obtained by querying the EventLogging db or the private SQL slaves (there are some exceptions like the revert graphs, which involve more pre-processing). Refreshing the data typically depends on scripts run hourly or daily via cronjobs on stat1. The datasets are then rsync'ed to stat1001. The dashboards live on multiple Limn instances (typically set up on labs and controlled by different teams) which host the datasource, graph and dashboard definitions.
*graph customization* it's no big deal to generate multiple dashboards in a scripted way (that's what we do when a new feature is deployed on a number of projects). What's tricky is the fact that different projects may have different feature sets enabled, each feature may be configured differently on a per-project basis, and in some cases different parameters (such as project-specific cutoff dates) need to be set for segmenting the data.
It's obvious that this process doesn't scale well, pulling data from 800 slaves can be a pain (Oliver recently shared some really good thoughts on this) and it's hard to keep track of what data exists for each project or where it's hosted. Centralizing the generation of the datasets and the corresponding graphs will enormously simplify the process of creating and discovering dashboards and I think we should start from the low-hanging fruit of EventLogging data. EventLogging produces well-defined, project-agnostic datasets that can be written natively into different stores (including SQL, Redis, Hadoop or flat files). So here's what we could do:
- we start producing dashboards for core metrics for all projects by
ingesting EventLogging data into Hadoop.
We have experimented with this in the past and it should not be too hard to re-enable. I will confirm with Andrew.
- next we experiment importing data from core MediaWiki tables (that by
definition exist on each project) and have no problem of graph customization/fine-tuning.
We also experimented with this; we have a tool called Sqoop to import the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII.
- finally, we define a registry of what features are enabled on each
project and selectively import from the production DB tables that are needed to generate the data and the corresponding parameters.
Would it make sense to read this info from the Mediaiwiki LocalSetting.php file or is that not containing all the relevant info?
Does this approach make sense and is there anything that prevents us from experimenting with step 1?
Nothing technical prevents us as far as I am aware; it's getting it appropriately prioritized so that we can work on it fast.
Dario
On Sep 18, 2013, at 7:13 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen < mflaschen@wikimedia.org> wrote:
On 09/18/2013 08:30 PM, Fabrice Florin wrote:
It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually. But I think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to
automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else?
What exactly is the feature request (automate what process)? D
Matt Flaschen
______________________________**_________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics
EE mailing list EE@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ee
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We also experimented with this; we have a tool called Sqoop to import the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII.
sounds like a good plan. I suspect we will need to import data from an uncensored source, tool labs removes much more than PII (for example, the archive table doesn't contain any PII and it's routinely used for internal data analysis).
Dario
On Thu, Sep 19, 2013 at 7:48 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
We also experimented with this; we have a tool called Sqoop to import
the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII.
sounds like a good plan. I suspect we will need to import data from an uncensored source, tool labs removes much more than PII (for example, the archive table doesn't contain any PII and it's routinely used for internal data analysis).
Ok, that's fine as well -- we would have to be a bit more careful with 'hammering' those prod slaves but that's all. An important question is how often would we need to import the data: daily, weekly etc.
Dario
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'd like to focus on sqooping in mysql databases as the solution to ingesting event logging data into Hadoop since it also applies to the production slaves. I think a cadence of daily updates is fine initially.
I'd also like to remove any production dependencies on Kafka until we've had a chance to run it in production with the mobile pageview data for a while. We're making good progress here but I don't want to have to debug in production.
-Toby
On Thu, Sep 19, 2013 at 7:54 AM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
On Thu, Sep 19, 2013 at 7:48 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
We also experimented with this; we have a tool called Sqoop to import
the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII.
sounds like a good plan. I suspect we will need to import data from an uncensored source, tool labs removes much more than PII (for example, the archive table doesn't contain any PII and it's routinely used for internal data analysis).
Ok, that's fine as well -- we would have to be a bit more careful with 'hammering' those prod slaves but that's all. An important question is how often would we need to import the data: daily, weekly etc.
Dario
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 09/19/2013 12:17 PM, Toby Negrin wrote:
I'd like to focus on sqooping in mysql databases as the solution to ingesting event logging data into Hadoop since it also applies to the production slaves. I think a cadence of daily updates is fine initially.
The EventLogging data is trimmed from MySQL after a while (or will be), so this needs to be done frequently enough that data is not lost, which may very by EL table.
Matt Flaschen
On Wed, Sep 18, 2013 at 7:13 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
On Wed, Sep 18, 2013 at 5:38 PM, Matthew Flaschen <mflaschen@wikimedia.org
wrote:
On 09/18/2013 08:30 PM, Fabrice Florin wrote:
It's a lack of automated tools.
Right now, Dario has to create each of them manually and it's not practical for him to support hundreds of sites, given his workload.
Yeah, it certainly doesn't make sense to do them all manually. But I think it would be great to be able to script this.
Someday, when we have more resources, our analytics team may be able to
automate this process, so we can support more sites.
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else?
What exactly is the feature request (automate what process)? D
Diederik and analytics: The part that was trimmed out of the original CC, can be found either at http://lists.wikimedia.org/pipermail/ee/2013-September/000694.html or if my copy-paste works (indent format-wise), then below.
On Sep 18, 2013, at 5:24 PM, Matthew Flaschen wrote:
- On 09/18/2013 08:10 PM, Fabrice Florin wrote:*>>* Thanks, Gayle!*>>* *>>* FYI, here are the metrics dashboards for the 3 largest sites in*>>* yesterday's release, which Dario was kind enough to create for us:*>>* *>>* http://ee-dashboard.wmflabs.org/dashboards/eswiki-features*%3E%3E* *>>* http://ee-dashboard.wmflabs.org/dashboards/nlwiki-features*%3E%3E* *>>* http://ee-dashboard.wmflabs.org/dashboards/hewiki-features*%3E%3E* *>>* We won't be able to create dashboards for all of the Echo sites, but we*>>* will aim to track metrics for the largest projects, for comparison*>>* purposes. Stay tuned for more ...*>* *>* Why can't they be created for all sites? Is it performance, or is the dashboard creation process not sufficiently automated?*>* *>* Matt Flaschen*
On 09/18/2013 10:13 PM, Diederik van Liere wrote:
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else?
What exactly is the feature request (automate what process)?
Sorry, I left out some of the context. Automating the creation of Limn dashboards like http://ee-dashboard.wmflabs.org/dashboards/eswiki-features , http://ee-dashboard.wmflabs.org/dashboards/nlwiki-features , and http://ee-dashboard.wmflabs.org/dashboards/hewiki-features . I believe they are generated from EventLogging data (on Echo usage), and the only variant is which wiki it is.
Matt Flaschen
On Wed, Sep 18, 2013 at 8:22 PM, Matthew Flaschen mflaschen@wikimedia.orgwrote:
On 09/18/2013 10:13 PM, Diederik van Liere wrote:
Agreed, I'm CCing Analytics on this. For feature requests like this, is it best to file an enhancement in Bugzilla, email the Analytics list, or something else?
What exactly is the feature request (automate what process)?
Sorry, I left out some of the context. Automating the creation of Limn dashboards like http://ee-dashboard.wmflabs.** org/dashboards/eswiki-featureshttp://ee-dashboard.wmflabs.org/dashboards/eswiki-features, http://ee-dashboard.wmflabs.**org/dashboards/nlwiki-featureshttp://ee-dashboard.wmflabs.org/dashboards/nlwiki-features, and http://ee-dashboard.wmflabs.**org/dashboards/hewiki-featureshttp://ee-dashboard.wmflabs.org/dashboards/hewiki-features. I believe they are generated from EventLogging data (on Echo usage), and the only variant is which wiki it is.
Seems a bit more complicated based on Dario's description. But I roughly like the steps that Dario's outlined. Diederik is working on the battle plan and we'll make sure it includes these steps somehow. We should try to stay organized and think about the schema that we're building in Hadoop as we import more and more data. This can get out of control very quickly.
On 09/19/2013 10:04 AM, Dan Andreescu wrote:
Seems a bit more complicated based on Dario's description.
Well, some of the graphs vary by wiki. Some do not (e.g. "Daily notifications" should work the same on any wiki with Echo).
But I agree Dario's plan makes sense.
Matt Flaschen