K, so a quick follow-up on this.  Jon and I worked today and identified two short term problems.

1. http://mobile-reportcard.wmflabs.org/graphs/edits-monthly-5plus-editors no longer updates because the query used for it takes too long to finish
2. the scripts run hourly even for graphs that only need to be updated daily

For 1, I fiddled with the SQL until it performed a little better.  It was also not correct, as I believe it was getting "the number of people who created an account in month X and made >= 5 edits anytime".  I changed it to what I assumed we wanted, which is "the number of people who created an account and made >= 5 edits in month X".  This new query (https://gist.github.com/milimetric/7554108) takes 4 minutes to run 3 months' worth.  Juliusz, any idea what the timeout is on that job?  I'm running the query now for 13 months and if it's < timeout, we can just deploy it.  Otherwise, we can maybe run one month at a time and concat results.  Let me know what you think and I'll make a Change

For 2, Jon made a Change: https://gerrit.wikimedia.org/r/#/c/96315/ and once merged, things will run at their configured frequencies


For the bigger picture, I'll be in SF in mid-December.  We should totally get together and figure out how to do this in the general case.  For example, notice in my query above I'm materializing all active editors for all months as a sub-query.  I think that would be a hugely useful materialized view (in Hive or MySQL or etc.).  Basically everyone would use it, and we could do the same thing for any standardized metric.


On Tue, Nov 19, 2013 at 9:16 AM, Dan Andreescu <dandreescu@wikimedia.org> wrote:



On Tue, Nov 19, 2013 at 2:07 AM, Matthew Flaschen <mflaschen@wikimedia.org> wrote:
On 11/13/2013 07:14 PM, Arthur Richards wrote:
So why are those backend scripts stupid? Because they run every hour and
recalculate _all_ the values for every single graph. For example, even
though total unique editors for June 2013 will never change, they are
still recalculated every hour.

Is it really true that they will never change?  I think many of the metrics are written such that when a page is deleted, it reduces edits in the past.  So if I delete a page today (November 2013) that happened to be edited in June 2013, that affects the June 2013 edit counts.

That isn't intuitive anyway, but if there's a change in this regard, it needs to be communicated.

Yeah, we refer to that as deletion drift.  Dario is heading an effort to make these metrics more standard and intuitive (https://meta.wikimedia.org/wiki/Research:Refining_the_definition_of_monthly_active_editors).  We'll have to see what we need for these dashboards and if the new definitions would help.