Hm, I don't think we will have much trouble with the size of the input.
Well my post was also about how to store hourly data in a concise manner (sparse array really), so we could serve hourly precision without too much overhead.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, October 02, 2013 6:19 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]
Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.
Hm, I don't think we will have much trouble with the size of the input. We're currently thinking of processing the hourly data through Hadoop, and that shouldn't even blink at a few TB of data per day. What we'd like to come to consensus on is the most useful output format. So far, I'm hearing that monthly aggregates by page in a MySQL database is the bare minimum we should release on day 1. We can then iterate and add any useful dimensions to this data (like category information) or increase the resolution in parallel tables. If the data becomes too large for MySQL we can look at other databases to store the data in.
One goody of these files: they detect missing input files and correct monthly counts to make up for this.
This is something we definitely want to port to Hadoop / use in some way.