Re: [Analytics] [Technical] Pick storage for pageview cubes

9 Jun 2015

(+ Eric)

On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin &lt;tnegrin(a)wikimedia.org&gt; wrote:

...
  As always, I'd recommend that we go with tech we
are familiar with --
 mysql or cassandra. We have a cassandra committer on staff who would be
 able to answer these questions in detail.

 -Toby

 On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns &lt;mforns(a)wikimedia.org&gt;
 wrote:

  *This discussion is intended to be a branch of
the thread: "[Analytics]
 Pageview API Status update".*

 Hi all,

 We Analytics are trying to *choose a storage technology to keep the
 pageview data* for analysis.

 We don't want to get to a final system that covers all our needs yet
 (there are still things to discuss), but have something *that implements
 the current stats.grok.se <http://stats.grok.se> functionalities* as a
 first step. This way we can have a better grasp of which will be our
 difficulties and limitations regarding performance and privacy.

 The objective of this thread is to *choose 3 storage technologies*. We
 will later setup an fill each of them with 1 day of test data, evaluate
 them and decide which one of them we will go for.

 There are 2 blocks of data to be stored:

    1. *Cube that represents the number of pageviews broken down by the
    following dimensions*:
       - day/hour (size: 24)
       - project (size: 800)
       - agent type (size: 2)

 To test with an initial level of anonymity, all cube cells whose value is
 less than k=100 have an undefined value. However, to be able to retrieve
 aggregated values without loosing that undefined counts, all combinations
 of slices and dices are precomputed before anonymization and belong to the
 cube, too. Like this:

 dim1,  dim2,  dim3,  ...,  dimN,  val
    a,  null,  null,  ...,  null,   15    // pv for dim1=a
    a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
    a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x &
 dim3=1
    a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ...
 & dimN=true

 So the size of this dataset would be something between 100M and 200M
 records per year, I think.

    1. *Timeseries dataset that stores the number of pageviews per
    article in time with*:
       - maximum resolution: hourly
       - diminishing resolution over time is accepted if there are
       performance problems

 article (dialect.project/article),       day/hour,   value

            en.wikipedia/Main_page,  2015-01-01 17,  123456

             en.wiktionary/Bazinga,  2015-01-02 13,   23456

 It's difficult to calculate the size of that. How many articles do we
 have? 34M?
 But not all of them will have pageviews every hour...

 *Note*: I guess we should consider that the storage system will
 presumably have high volume batch inserts every hour or so, and queries
 that will be a lot more frequent but also a lot lighter in data size.

 And that is that.
 *So please, feel free to suggest storage technologies, comment, etc!*
 And if there is any assumption I made in which you do not agree, please
 comment also!

 I will start the thread with 2 suggestions:
 1) *PostgreSQL*: Seems to be able to handle the volume of the data and
 knows how to implement diminishing resolution for timeseries.
 2) *Project Voldemort*: As we are denormalizing the cube entirely for
 anonymity, the db doesn't need to compute aggregations, so it may well be a
 key-value store.

 Cheers!

 Marcel

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Technical] Pick storage for pageview cubes