Re: [Analytics] [Technical] Pick storage for pageview cubes

8 Jun 2015


      On Mon, Jun 8, 2015 at 7:44 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
...
(+ Eric)
On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin tnegrin@wikimedia.org wrote:
...
As always, I'd recommend that we go with tech we are familiar with --
mysql or cassandra. We have a cassandra committer on staff who would be able
to answer these questions in detail.
I guess that'd be me; Happy to help if I can!
...
...
On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns mforns@wikimedia.org
wrote:
[ ... ]
...
...
...
There are 2 blocks of data to be stored:
Cube that represents the number of pageviews broken down by the following
dimensions:
day/hour (size: 24)
project (size: 800)
agent type (size: 2)
To test with an initial level of anonymity, all cube cells whose value is
less than k=100 have an undefined value. However, to be able to retrieve
aggregated values without loosing that undefined counts, all combinations of
slices and dices are precomputed before anonymization and belong to the
cube, too. Like this:
dim1,  dim2,  dim3,  ...,  dimN,  val
   a,  null,  null,  ...,  null,   15    // pv for dim1=a
   a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
   a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x &
dim3=1
   a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ...
& dimN=true
So the size of this dataset would be something between 100M and 200M
records per year, I think.
Could you expound on this a bit?  Is it just the 3 dimensions above
(day, project, type), or something more? Also, how will this be
queried?  Do we need to query by dimensions arbitrarily, or will the
"higher" dimensions always be qualified with matches on the lower
ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a
& dim2=x & dimN=true)?
...
...
...
Timeseries dataset that stores the number of pageviews per article in
time with:
maximum resolution: hourly
diminishing resolution over time is accepted if there are performance
problems
article (dialect.project/article),       day/hour,   value
       en.wikipedia/Main_page,  2015-01-01 17,  123456

        en.wiktionary/Bazinga,  2015-01-02 13,   23456


It's difficult to calculate the size of that. How many articles do we
have? 34M?
But not all of them will have pageviews every hour...
Note: I guess we should consider that the storage system will presumably
have high volume batch inserts every hour or so, and queries that will be a
lot more frequent but also a lot lighter in data size.
-- 
Eric Evans
eevans@wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Technical] Pick storage for pageview cubes