K, so a quick follow-up on this. Jon and I worked today and identified two short term problems.
2. the scripts run hourly even for graphs that only need to be updated daily
For 1, I fiddled with the SQL until it performed a little better. It was also not correct, as I believe it was getting "the number of people who created an account in month X and made >= 5 edits anytime". I changed it to what I assumed we wanted, which is "the number of people who created an account and made >= 5 edits in month X". This new query (
https://gist.github.com/milimetric/7554108) takes 4 minutes to run 3 months' worth. Juliusz, any idea what the timeout is on that job? I'm running the query now for 13 months and if it's < timeout, we can just deploy it. Otherwise, we can maybe run one month at a time and concat results. Let me know what you think and I'll make a Change
For the bigger picture, I'll be in SF in mid-December. We should totally get together and figure out how to do this in the general case. For example, notice in my query above I'm materializing all active editors for all months as a sub-query. I think that would be a hugely useful materialized view (in Hive or MySQL or etc.). Basically everyone would use it, and we could do the same thing for any standardized metric.