Here is shameless plug for existing data files in a highly condensed format, which could serve as input for whatever database we choose.


Hm, I don't think we will have much trouble with the size of the input.  We're currently thinking of processing the hourly data through Hadoop, and that shouldn't even blink at a few TB of data per day.  What we'd like to come to consensus on is the most useful output format.  So far, I'm hearing that monthly aggregates by page in a MySQL database is the bare minimum we should release on day 1.  We can then iterate and add any useful dimensions to this data (like category information) or increase the resolution in parallel tables.  If the data becomes too large for MySQL we can look at other databases to store the data in.
 

One goody of these files: they detect missing input files and correct monthly counts to make up for this.


This is something we definitely want to port to Hadoop / use in some way.