Hi,
today there were some guesses about how much data we are actually talking for the “queryable public interface for pageview data”. As no one followed-up on that, I did some quick “back of the envelope” computations.
The pagecounts files now come with ~7-9M data points/hour.
As we talked about computing "per day" data, some readings snuggle up and collapse, and for the two days that I computed I arrived at ~70M datapoints/day.
So each day we can expect ~70M new data points = ~70M new rows.
Now to the per row cost:
On Mon, Sep 30, 2013 at 01:57:58PM -0700, Diederik van Liere wrote:
- Make simple datawarehouse schema for mysql db (based on the current
webstatscollector datafiles)
Page Table
page_id page title
Let's ignore that table for now.
Fact Table
fact_id
Let's ignore "fact_id" as it's more than the bare pagecount files provide.
page_id (FK -> Page Table)
As we're having 70M different rows/day, MEDIUMINT is too small. Having an INT foreign key, we may also run out of numbers soon. But let's also ignore that for now, and settle with INT = 4 bytes.
pageview date
DATE is 3 bytes
pageview count
Zipping through some random pagecount files, I saw values up to ~17M/hour. 17M is to small for MEDIUMINT, so we have to take at least INT = 4 bytes. That also suffices for the per day sums.
bytes served
Zipping through some random pagecount files, I saw values up to ~54G/hour. That's too big for INT. So let's settle for BIGINT = 8 bytes. That also suffices for the per day sums.
Summing up, we have 4+3+4+8 = 19 bytes per data point.
19 bytes * 70M * 365 days = 485 GB in 1 year.
And here the 19 bytes are really minimal. It will suffer overrun ids. And id does not contain a row id. 70M is very, very defensive. And the total size is without the page table, without indices, without all the other stuff that comes in.
Have fun, Christian
P.S.: There was some discussion whether or not we should drop "bytes served", as "no one uses it". I am glad we did not second-guess the requirements and took the field in, as that bought us a nice argument for free, if we decide to drop/discuss it. "bytes served" is worth ~200GB in 1 year, when modelling it directly.
While such a decrease would help, it would not solve the problem, as ~285 GB data per year remain.
P.P.S.: Hope is not lost :-) There are of course thousands of ways to get the size down.