As always, I'd recommend that we go with tech we
are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would
be able to answer these questions in detail.
WMF uses PostGRES for some things, no? Or is that is just in labs?
> On Jun 8, 2015, at 17:42, Toby Negrin <tnegrin(a)wikimedia.org> wrote:
>
As always, I'd recommend that we go with tech we
are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would
be able to answer these questions in detail.
>
> -Toby
>
> On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org
<mailto:mforns@wikimedia.org>> wrote:
> This discussion is intended to be a branch of the thread: "[Analytics] Pageview
API Status update".
>
> Hi all,
>
> We Analytics are trying to choose a storage technology to keep the pageview data for
analysis.
>
> We don't want to get to a final system that covers all our needs yet (there are
still things to discuss), but have something that implements the current stats.grok.se
<http://stats.grok.se/> functionalities as a first step. This way we can have a
better grasp of which will be our difficulties and limitations regarding performance and
privacy.
>
> The objective of this thread is to choose 3 storage technologies. We will later setup
an fill each of them with 1 day of test data, evaluate them and decide which one of them
we will go for.
>
> There are 2 blocks of data to be stored:
> Cube that represents the number of pageviews broken down by the following
dimensions:
> day/hour (size: 24)
> project (size: 800)
> agent type (size: 2)
> To test with an initial level of anonymity, all cube cells whose value is less than
k=100 have an undefined value. However, to be able to retrieve aggregated values without
loosing that undefined counts, all combinations of slices and dices are precomputed before
anonymization and belong to the cube, too. Like this:
>
> dim1, dim2, dim3, ..., dimN, val
> a, null, null, ..., null, 15 // pv for dim1=a
> a, x, null, ..., null, 34 // pv for dim1=a & dim2=x
> a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1
> a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ...
& dimN=true
>
> So the size of this dataset would be something between 100M and 200M records per
year, I think.
>
> Timeseries dataset that stores the number of pageviews per article in time with:
> maximum resolution: hourly
> diminishing resolution over time is accepted if there are performance problems
> article (dialect.project/article), day/hour, value
> en.wikipedia/Main_page, 2015-01-01 17, 123456
> en.wiktionary/Bazinga, 2015-01-02 13, 23456
>
> It's difficult to calculate the size of that. How many articles do we have? 34M?
> But not all of them will have pageviews every hour...
>
>
> Note: I guess we should consider that the storage system will presumably have high
volume batch inserts every hour or so, and queries that will be a lot more frequent but
also a lot lighter in data size.
>
> And that is that.
> So please, feel free to suggest storage technologies, comment, etc!
> And if there is any assumption I made in which you do not agree, please comment
also!
>
> I will start the thread with 2 suggestions:
> 1) PostgreSQL: Seems to be able to handle the volume of the data and knows how to
implement diminishing resolution for timeseries.
> 2) Project Voldemort: As we are denormalizing the cube entirely for anonymity, the db
doesn't need to compute aggregations, so it may well be a key-value store.
>
> Cheers!
>
> Marcel
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics