HI,
On Wed, Nov 13, 2013 at 05:14:51PM -0700, Arthur Richards wrote:
[...]
On Wed, Nov 13, 2013 at 3:54 PM, Juliusz Gonera jgonera@wikimedia.org wrote:
Because they run every hour and recalculate _all_ the values for every single graph. For example, even though total unique editors for June 2013 will never change, they are
still
recalculated every hour.
Several of our jobs had to overcome the same problem. The solution there was the same as you proposed: A container to store aggregated, historic data and reusing this data when generating the graphs.
Adding yesterday's data to the container is one cron job. Generating the graphs from the data in the container is a separate cron job. This separation proved to be useful on many occasions.
For some jobs the container itself is a separate database (e.g.: geowiki), and for other jobs the container is a set of plain files (e.g.: Wikipedia Zero). Both approaches come with the obvious (dis-)advantages: Querying a database is efficient and easy. But putting data under version control and monitor changes when having to rerun aggregation for say the last two weeks is easier when working with plain files.
We could start with a spike investigating if there is a framework for aggregating the sums [...]
Our approaches are hard-wired into our legacy code. So we do not use a common, solid framework for it.
I haven't done any research on whether or not such frameworks exist. But if you find some good framework, please let us know, it would certainly be interesting.
Best regards, Christian