Here is shameless plug for existing data files in a
highly condensed
format, which could serve as input for whatever database we choose. ****
**
Hm, I don't think we will have much trouble with the size of the input.
We're currently thinking of processing the hourly data through Hadoop, and
that shouldn't even blink at a few TB of data per day. What we'd like to
come to consensus on is the most useful output format. So far, I'm hearing
that monthly aggregates by page in a MySQL database is the bare minimum we
should release on day 1. We can then iterate and add any useful dimensions
to this data (like category information) or increase the resolution in
parallel tables. If the data becomes too large for MySQL we can look at
other databases to store the data in.
One goody of these files: they detect missing input
files and correct
monthly counts to make up for this.
This is something we definitely want to port to Hadoop / use in some way.