Hi all,
We Analytics are trying to choose a storage technology to keep the pageview data for analysis.
We don't want to get to a final system that covers all our needs yet (there are still things to discuss), but have something
that implements the current stats.grok.se functionalities as a first step. This way we can have a better grasp of which will be our difficulties and limitations regarding performance and privacy.
The objective of this thread is to choose 3 storage technologies. We will later setup an fill each of them with 1 day of test data, evaluate them and decide which one of them we will go for.
There are 2 blocks of data to be stored:
- Cube that represents the number of pageviews broken down by the following dimensions:
- day/hour (size: 24)
- project (size: 800)
- agent type (size: 2)
Note: I guess we should consider that the storage system will presumably have high volume batch inserts every hour or so, and queries that will be a lot more frequent but also a lot lighter in data size.
And that is that.
So please, feel free to suggest storage technologies, comment, etc!
And if there is any assumption I made in which you do not agree, please comment also!
I will start the thread with 2 suggestions:
1) PostgreSQL: Seems to be able to handle the volume of the data and knows how to implement diminishing resolution for timeseries.
2) Project Voldemort: As we are denormalizing the cube entirely for anonymity, the db doesn't need to compute aggregations, so it may well be a key-value store.
Cheers!
Marcel