[Technical] Pick storage for pageview cubes

List overview All Threads
Download

newer

older

Contribute

Github releases Brubeck, a...

Marcel Ruiz Forns

9 Jun 2015 9 Jun '15

7:46 a.m.

*This discussion is intended to be a branch of the thread: "[Analytics] Pageview API Status update".*

Hi all,

We Analytics are trying to *choose a storage technology to keep the pageview data* for analysis.

We don't want to get to a final system that covers all our needs yet (there are still things to discuss), but have something *that implements the current stats.grok.se http://stats.grok.se functionalities* as a first step. This way we can have a better grasp of which will be our difficulties and limitations regarding performance and privacy.

The objective of this thread is to *choose 3 storage technologies*. We will later setup an fill each of them with 1 day of test data, evaluate them and decide which one of them we will go for.

There are 2 blocks of data to be stored:

1. *Cube that represents the number of pageviews broken down by the following dimensions*: - day/hour (size: 24) - project (size: 800) - agent type (size: 2)

To test with an initial level of anonymity, all cube cells whose value is less than k=100 have an undefined value. However, to be able to retrieve aggregated values without loosing that undefined counts, all combinations of slices and dices are precomputed before anonymization and belong to the cube, too. Like this:

dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

1. *Timeseries dataset that stores the number of pageviews per article in time with*: - maximum resolution: hourly - diminishing resolution over time is accepted if there are performance problems

article (dialect.project/article), day/hour, value

en.wikipedia/Main_page, 2015-01-01 17, 123456

en.wiktionary/Bazinga, 2015-01-02 13, 23456

It's difficult to calculate the size of that. How many articles do we have? 34M? But not all of them will have pageviews every hour...

*Note*: I guess we should consider that the storage system will presumably have high volume batch inserts every hour or so, and queries that will be a lot more frequent but also a lot lighter in data size.

And that is that. *So please, feel free to suggest storage technologies, comment, etc!* And if there is any assumption I made in which you do not agree, please comment also!

I will start the thread with 2 suggestions: 1) *PostgreSQL*: Seems to be able to handle the volume of the data and knows how to implement diminishing resolution for timeseries. 2) *Project Voldemort*: As we are denormalizing the cube entirely for anonymity, the db doesn't need to compute aggregations, so it may well be a key-value store.

Cheers!

Marcel

Attachments:

attachment.htm (text/html — 4.6 KB)

Show replies by date

Toby Negrin

9 Jun 9 Jun

8:42 a.m.

As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

-Toby

On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

*This discussion is intended to be a branch of the thread: "[Analytics] Pageview API Status update".*

Hi all,

We Analytics are trying to *choose a storage technology to keep the pageview data* for analysis.

We don't want to get to a final system that covers all our needs yet (there are still things to discuss), but have something *that implements the current stats.grok.se http://stats.grok.se functionalities* as a first step. This way we can have a better grasp of which will be our difficulties and limitations regarding performance and privacy.

The objective of this thread is to *choose 3 storage technologies*. We will later setup an fill each of them with 1 day of test data, evaluate them and decide which one of them we will go for.

There are 2 blocks of data to be stored:

*Cube that represents the number of pageviews broken down by the

following dimensions*: - day/hour (size: 24) - project (size: 800) - agent type (size: 2)

To test with an initial level of anonymity, all cube cells whose value is less than k=100 have an undefined value. However, to be able to retrieve aggregated values without loosing that undefined counts, all combinations of slices and dices are precomputed before anonymization and belong to the cube, too. Like this:

dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

*Timeseries dataset that stores the number of pageviews per article

in time with*: - maximum resolution: hourly - diminishing resolution over time is accepted if there are performance problems

article (dialect.project/article), day/hour, value
       en.wikipedia/Main_page,  2015-01-01 17,  123456

        en.wiktionary/Bazinga,  2015-01-02 13,   23456
It's difficult to calculate the size of that. How many articles do we have? 34M? But not all of them will have pageviews every hour...

*Note*: I guess we should consider that the storage system will presumably have high volume batch inserts every hour or so, and queries that will be a lot more frequent but also a lot lighter in data size.

And that is that. *So please, feel free to suggest storage technologies, comment, etc!* And if there is any assumption I made in which you do not agree, please comment also!

I will start the thread with 2 suggestions:

*PostgreSQL*: Seems to be able to handle the volume of the data and

knows how to implement diminishing resolution for timeseries. 2) *Project Voldemort*: As we are denormalizing the cube entirely for anonymity, the db doesn't need to compute aggregations, so it may well be a key-value store.

Cheers!

Marcel

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Gabriel Wicke

8:44 a.m.

(+ Eric)

On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...

As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

-Toby

On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
*This discussion is intended to be a branch of the thread: "[Analytics] Pageview API Status update".*

Hi all,

We Analytics are trying to *choose a storage technology to keep the pageview data* for analysis.

We don't want to get to a final system that covers all our needs yet (there are still things to discuss), but have something *that implements the current stats.grok.se http://stats.grok.se functionalities* as a first step. This way we can have a better grasp of which will be our difficulties and limitations regarding performance and privacy.

The objective of this thread is to *choose 3 storage technologies*. We will later setup an fill each of them with 1 day of test data, evaluate them and decide which one of them we will go for.

There are 2 blocks of data to be stored:

*Cube that represents the number of pageviews broken down by the

following dimensions*: - day/hour (size: 24) - project (size: 800) - agent type (size: 2)

To test with an initial level of anonymity, all cube cells whose value is less than k=100 have an undefined value. However, to be able to retrieve aggregated values without loosing that undefined counts, all combinations of slices and dices are precomputed before anonymization and belong to the cube, too. Like this:

dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

*Timeseries dataset that stores the number of pageviews per

article in time with*: - maximum resolution: hourly - diminishing resolution over time is accepted if there are performance problems

article (dialect.project/article), day/hour, value
       en.wikipedia/Main_page,  2015-01-01 17,  123456

        en.wiktionary/Bazinga,  2015-01-02 13,   23456
It's difficult to calculate the size of that. How many articles do we have? 34M? But not all of them will have pageviews every hour...

*Note*: I guess we should consider that the storage system will presumably have high volume batch inserts every hour or so, and queries that will be a lot more frequent but also a lot lighter in data size.

And that is that. *So please, feel free to suggest storage technologies, comment, etc!* And if there is any assumption I made in which you do not agree, please comment also!

I will start the thread with 2 suggestions:

*PostgreSQL*: Seems to be able to handle the volume of the data and

knows how to implement diminishing resolution for timeseries. 2) *Project Voldemort*: As we are denormalizing the cube entirely for anonymity, the db doesn't need to compute aggregations, so it may well be a key-value store.

Cheers!

Marcel

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Eric Evans

9:30 a.m.

On Mon, Jun 8, 2015 at 7:44 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

(+ Eric)

On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

I guess that'd be me; Happy to help if I can!

...

...
On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

[ ... ]

...

...
...
There are 2 blocks of data to be stored:

Cube that represents the number of pageviews broken down by the following dimensions:

day/hour (size: 24) project (size: 800) agent type (size: 2)

To test with an initial level of anonymity, all cube cells whose value is less than k=100 have an undefined value. However, to be able to retrieve aggregated values without loosing that undefined counts, all combinations of slices and dices are precomputed before anonymization and belong to the cube, too. Like this:

dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

Could you expound on this a bit? Is it just the 3 dimensions above (day, project, type), or something more? Also, how will this be queried? Do we need to query by dimensions arbitrarily, or will the "higher" dimensions always be qualified with matches on the lower ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a & dim2=x & dimN=true)?

...

...
...
Timeseries dataset that stores the number of pageviews per article in time with:

maximum resolution: hourly diminishing resolution over time is accepted if there are performance problems

article (dialect.project/article), day/hour, value
       en.wikipedia/Main_page,  2015-01-01 17,  123456

        en.wiktionary/Bazinga,  2015-01-02 13,   23456
It's difficult to calculate the size of that. How many articles do we have? 34M? But not all of them will have pageviews every hour...

Note: I guess we should consider that the storage system will presumably have high volume batch inserts every hour or so, and queries that will be a lot more frequent but also a lot lighter in data size.

-- Eric Evans eevans@wikimedia.org

Andrew Otto

8:57 a.m.

...

As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

...

On Jun 8, 2015, at 17:42, Toby Negrin tnegrin@wikimedia.org wrote:

As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

-Toby

On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <mforns@wikimedia.org mailto:mforns@wikimedia.org> wrote: This discussion is intended to be a branch of the thread: "[Analytics] Pageview API Status update".

Hi all,

We Analytics are trying to choose a storage technology to keep the pageview data for analysis.

We don't want to get to a final system that covers all our needs yet (there are still things to discuss), but have something that implements the current stats.grok.se http://stats.grok.se/ functionalities as a first step. This way we can have a better grasp of which will be our difficulties and limitations regarding performance and privacy.

The objective of this thread is to choose 3 storage technologies. We will later setup an fill each of them with 1 day of test data, evaluate them and decide which one of them we will go for.

There are 2 blocks of data to be stored: Cube that represents the number of pageviews broken down by the following dimensions: day/hour (size: 24) project (size: 800) agent type (size: 2) To test with an initial level of anonymity, all cube cells whose value is less than k=100 have an undefined value. However, to be able to retrieve aggregated values without loosing that undefined counts, all combinations of slices and dices are precomputed before anonymization and belong to the cube, too. Like this:

dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

Timeseries dataset that stores the number of pageviews per article in time with: maximum resolution: hourly diminishing resolution over time is accepted if there are performance problems article (dialect.project/article), day/hour, value en.wikipedia/Main_page, 2015-01-01 17, 123456 en.wiktionary/Bazinga, 2015-01-02 13, 23456

It's difficult to calculate the size of that. How many articles do we have? 34M? But not all of them will have pageviews every hour...

Note: I guess we should consider that the storage system will presumably have high volume batch inserts every hour or so, and queries that will be a lot more frequent but also a lot lighter in data size.

And that is that. So please, feel free to suggest storage technologies, comment, etc! And if there is any assumption I made in which you do not agree, please comment also!

I will start the thread with 2 suggestions:

PostgreSQL: Seems to be able to handle the volume of the data and knows how to implement diminishing resolution for timeseries.

Project Voldemort: As we are denormalizing the cube entirely for anonymity, the db doesn't need to compute aggregations, so it may well be a key-value store.

Cheers!

Marcel

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

10:52 a.m.

...

As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with labs folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not familiar with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never.

Oliver Keyes

2:39 p.m.

Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions.

On 8 June 2015 at 22:52, Dan Andreescu dandreescu@wikimedia.org wrote:

...

...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with labs folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not familiar with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dario Taraborelli

10:10 p.m.

I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?

...

On Jun 9, 2015, at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:

Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions.

On 8 June 2015 at 22:52, Dan Andreescu dandreescu@wikimedia.org wrote:

...
...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with labs folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not familiar with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Adam Baso

11:55 p.m.

Does one of these options support both SQL based and REST interfaces with TSV and JSON output with little user setup? I'm thinking of your typical Ubuntu/Mac/Windows user with the ability to import TSV data into a spreadsheet.

On Tuesday, June 9, 2015, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...

I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?

...
On Jun 9, 2015, at 8:39 AM, Oliver Keyes <okeyes@wikimedia.org

javascript:;> wrote:

...
Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions.

On 8 June 2015 at 22:52, Dan Andreescu <dandreescu@wikimedia.org

javascript:;> wrote:

...
...
...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would

be able

...
...
...
to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with

labs

...
...
folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not

familiar

...
...
with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL

for

...
...
this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never.

Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics

Gabriel Wicke

10 Jun 10 Jun

12:59 a.m.

I think Eric's original response got lost, so let me include it below:

...

...
...
dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

...

Could you expound on this a bit? Is it just the 3 dimensions above (day, project, type), or something more? Also, how will this be queried? Do we need to query by dimensions arbitrarily, or will the "higher" dimensions always be qualified with matches on the lower ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a & dim2=x & dimN=true)?

Out of the box, Cassandra is fairly limited in the kind of indexing it provides. Its main data structure is a distributed hash table, with the ability to select a single hierarchical range below a fixed key. This is why Eric asked about whether your query patterns are hierarchical.

There is some very limited support for secondary indexes, but those work very differently from relational databases (only equality & no efficient global queries), so are only useful in special cases. In RESTBase we have built an abstraction that lets us maintain more useful secondary indexes in regular Cassandra tables. However, this implementation still lacks features like dynamic index creation and authentication, so is not anywhere close to the functionality provided by a relational database.

In any case, I think it makes sense to isolate analytics backends from production content storage. Trouble in the page view backend should not affect content storage end points. We can still expose and document both in a single public API at /api/rest_v1/ using RESTBase, but are free to use any backend service or storage. The backend service could be built using the service template https://github.com/wikimedia/service-template-node and some $DB, a RESTBase instance or any other technology if it makes sense, as long as it exposes a REST-ish API that is reasonably easy to use and proxy.

Gabriel

On Tue, Jun 9, 2015 at 7:10 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?

...
On Jun 9, 2015, at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:

Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions.

On 8 June 2015 at 22:52, Dan Andreescu dandreescu@wikimedia.org wrote:

...
...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would

be able

...
...
...
to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with

labs

...
...
folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not

familiar

...
...
with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL

for

...
...
this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Dan Andreescu

2:53 a.m.

Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality. For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

As for relational databases not being able to handle this, I remain optimistic. Toby, we're planning on diminishing the time resolution the way RRD files do, so growth over time should not be an issue. So the only problem is can we insert hourly data in bulk without killing the db? We'll have to run tests but assuming each hour is around 1GB, PostgreSQL should be able to insert this in a few minutes without too many problems (using COPY or pg_bulkload).

Finding good alternatives to investigate is important though, the more tailored to the problem the better. So I'd like to work some more on this list:

* Cassandra * PostgreSQL * Voldemort

(lol, otto :))

On Tue, Jun 9, 2015 at 12:59 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

I think Eric's original response got lost, so let me include it below:

...
...
...
dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x &

...

...
...
...
& dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

...
Could you expound on this a bit? Is it just the 3 dimensions above (day, project, type), or something more? Also, how will this be queried? Do we need to query by dimensions arbitrarily, or will the "higher" dimensions always be qualified with matches on the lower ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a & dim2=x & dimN=true)?

Out of the box, Cassandra is fairly limited in the kind of indexing it provides. Its main data structure is a distributed hash table, with the ability to select a single hierarchical range below a fixed key. This is why Eric asked about whether your query patterns are hierarchical.

There is some very limited support for secondary indexes, but those work very differently from relational databases (only equality & no efficient global queries), so are only useful in special cases. In RESTBase we have built an abstraction that lets us maintain more useful secondary indexes in regular Cassandra tables. However, this implementation still lacks features like dynamic index creation and authentication, so is not anywhere close to the functionality provided by a relational database.

In any case, I think it makes sense to isolate analytics backends from production content storage. Trouble in the page view backend should not affect content storage end points. We can still expose and document both in a single public API at /api/rest_v1/ using RESTBase, but are free to use any backend service or storage. The backend service could be built using the service template https://github.com/wikimedia/service-template-node and some $DB, a RESTBase instance or any other technology if it makes sense, as long as it exposes a REST-ish API that is reasonably easy to use and proxy.

Gabriel

On Tue, Jun 9, 2015 at 7:10 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?

...
On Jun 9, 2015, at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:

Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions.

On 8 June 2015 at 22:52, Dan Andreescu dandreescu@wikimedia.org

wrote:

...
...
...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would

be able

...
...
...
to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way,

we

...
...
could put it in the PostgreSQL instance on labs. We should check with

labs

...
...
folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not

familiar

...
...
with my hyperbolic style, I'd rather drink Bud Lite Lime than use

MySQL for

...
...
this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Gabriel Wicke

5:23 a.m.

On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality.

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

...

For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... .

Dan Andreescu

7:01 a.m.

On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality.

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

...

...
For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... .

very cool, thx.

Marcel Ruiz Forns

7:24 p.m.

If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is more like a key-value store. That's why I suggested Voldemort at first (which, they say, has a slightly faster read than Cassandra), but I see the preference for Cassandra for it being a known tool inside WMF. So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or querying them in a different way, key-value store would be a limitation.

On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality.

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

...
...
For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... .

very cool, thx.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Marcel Ruiz Forns

12 Jun 12 Jun

5:11 a.m.

This thread seems to have paused for 1 or 2 days now.

So summarizing, the following storage technologies have been mentioned:

- PostgreSQL - MySQL - Cassandra - Voldemort

And the following concerns have been raised on using something that:

- We're already familiar with - Permits meta-analytics - Is queriable for json/tsv with little user setup - Withstands high throughput bulk inserts - Is queriable for slice and dice, even if we need to precompute those

It seems that there aren't many candidates and that the discussion focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, say PostgreSQL and Cassandra?

Or, anyone with more thoughts or suggestions?

On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is more like a key-value store. That's why I suggested Voldemort at first (which, they say, has a slightly faster read than Cassandra), but I see the preference for Cassandra for it being a known tool inside WMF. So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or querying them in a different way, key-value store would be a limitation.

On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality.

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

...
...
For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... .

very cool, thx.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Joseph Allemandou

5:37 p.m.

I think we could add Impala in storage technologies to assess. It allows reading / computing straight from HDFS and should be fast enough for not too bad UEx. Maybe ?

On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

This thread seems to have paused for 1 or 2 days now.

So summarizing, the following storage technologies have been mentioned:

PostgreSQL

MySQL

Cassandra

Voldemort

And the following concerns have been raised on using something that:

We're already familiar with

Permits meta-analytics

Is queriable for json/tsv with little user setup

Withstands high throughput bulk inserts

Is queriable for slice and dice, even if we need to precompute those

It seems that there aren't many candidates and that the discussion focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, say PostgreSQL and Cassandra?

Or, anyone with more thoughts or suggestions?

On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is more like a key-value store. That's why I suggested Voldemort at first (which, they say, has a slightly faster read than Cassandra), but I see the preference for Cassandra for it being a known tool inside WMF. So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or querying them in a different way, key-value store would be a limitation.

On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality.

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

...
...
For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... .

very cool, thx.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Andrew Otto

13 Jun 13 Jun

12:46 a.m.

...

I think we could add Impala in storage technologies to assess.

I think we don’t want to build the pageview API on top of the Analytics Cluster.

...

On Jun 12, 2015, at 05:37, Joseph Allemandou jallemandou@wikimedia.org wrote:

I think we could add Impala in storage technologies to assess. It allows reading / computing straight from HDFS and should be fast enough for not too bad UEx. Maybe ?

On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <mforns@wikimedia.org mailto:mforns@wikimedia.org> wrote: This thread seems to have paused for 1 or 2 days now.

So summarizing, the following storage technologies have been mentioned: PostgreSQL MySQL Cassandra Voldemort And the following concerns have been raised on using something that: We're already familiar with Permits meta-analytics Is queriable for json/tsv with little user setup Withstands high throughput bulk inserts Is queriable for slice and dice, even if we need to precompute those It seems that there aren't many candidates and that the discussion focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, say PostgreSQL and Cassandra?

Or, anyone with more thoughts or suggestions?

On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <mforns@wikimedia.org mailto:mforns@wikimedia.org> wrote: If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is more like a key-value store. That's why I suggested Voldemort at first (which, they say, has a slightly faster read than Cassandra), but I see the preference for Cassandra for it being a known tool inside WMF. So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or querying them in a different way, key-value store would be a limitation.

On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote:

On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote: On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality.

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js.

very cool, thx.

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

3:58 a.m.

As someone who has run production serving systems on top of Hadoop, I think this is risky. We've had substantial planned and unplanned downtime on the cluster (which is to be expected) and it would be bad for a pageview API to be impacted.

-Toby

On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto aotto@wikimedia.org wrote:

...

I think we could add Impala in storage technologies to assess.

I think we don’t want to build the pageview API on top of the Analytics Cluster.

On Jun 12, 2015, at 05:37, Joseph Allemandou jallemandou@wikimedia.org wrote:

I think we could add Impala in storage technologies to assess. It allows reading / computing straight from HDFS and should be fast enough for not too bad UEx. Maybe ?

On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
This thread seems to have paused for 1 or 2 days now.

So summarizing, the following storage technologies have been mentioned:

PostgreSQL

MySQL

Cassandra

Voldemort

And the following concerns have been raised on using something that:

We're already familiar with

Permits meta-analytics

Is queriable for json/tsv with little user setup

Withstands high throughput bulk inserts

Is queriable for slice and dice, even if we need to precompute those

It seems that there aren't many candidates and that the discussion focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, say PostgreSQL and Cassandra?

Or, anyone with more thoughts or suggestions?

On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is more like a key-value store. That's why I suggested Voldemort at first (which, they say, has a slightly faster read than Cassandra), but I see the preference for Cassandra for it being a known tool inside WMF. So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or querying them in a different way, key-value store would be a limitation.

On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality.

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

...
...
For the article-level data, no, we'd want just basic timeseries querying.

Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... .

very cool, thx.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Joseph Allemandou

5:37 p.m.

Andrew, Toby, that makes perfect sense. While thinking that the distributed aspect of Impala would handle high availability issues, I very much understand that having a front-end system relying on the analytics cluster is not as good as having a dedicated storage solution. Thanks for the good point :) Joseph

On Fri, Jun 12, 2015 at 9:58 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...

As someone who has run production serving systems on top of Hadoop, I think this is risky. We've had substantial planned and unplanned downtime on the cluster (which is to be expected) and it would be bad for a pageview API to be impacted.

-Toby

On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto aotto@wikimedia.org wrote:

...
I think we could add Impala in storage technologies to assess.

I think we don’t want to build the pageview API on top of the Analytics Cluster.

On Jun 12, 2015, at 05:37, Joseph Allemandou jallemandou@wikimedia.org wrote:

I think we could add Impala in storage technologies to assess. It allows reading / computing straight from HDFS and should be fast enough for not too bad UEx. Maybe ?

On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <mforns@wikimedia.org

...
wrote:

...
This thread seems to have paused for 1 or 2 days now.

So summarizing, the following storage technologies have been mentioned:

PostgreSQL

MySQL

Cassandra

Voldemort

And the following concerns have been raised on using something that:

We're already familiar with

Permits meta-analytics

Is queriable for json/tsv with little user setup

Withstands high throughput bulk inserts

Is queriable for slice and dice, even if we need to precompute

those

It seems that there aren't many candidates and that the discussion focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, say PostgreSQL and Cassandra?

Or, anyone with more thoughts or suggestions?

On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <mforns@wikimedia.org

...
wrote:

...
If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is more like a key-value store. That's why I suggested Voldemort at first (which, they say, has a slightly faster read than Cassandra), but I see the preference for Cassandra for it being a known tool inside WMF. So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or querying them in a different way, key-value store would be a limitation.

On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

> Eric, I think we should allow arbitrary querying on any dimension > for that first data block. We could pre-aggregate all of those > combinations pretty easily since the dimensions have very low cardinality. >

Are you thinking about something like /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more dimensions?

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

...
> For the article-level data, no, we'd want just basic timeseries > querying. > > Thanks Gabriel, if you could point us to an example of these > secondary RESTBase indices, that'd be interesting. >

The API used to define these tables is described in https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, and the algorithm used to keep those indexes up to date is described in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... and largely implemented in https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... .

very cool, thx.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Marcel Ruiz Forns

17 Jun 17 Jun

1:57 a.m.

OK, so I think we have our candidates: 1) PostgreSQL 2) Cassandra

We can speak about this at our next tasking meeting. If someone has more suggestions or comments, we've still a couple days until then.

Thank you all!

Marcel

On Sat, Jun 13, 2015 at 11:37 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...

Andrew, Toby, that makes perfect sense. While thinking that the distributed aspect of Impala would handle high availability issues, I very much understand that having a front-end system relying on the analytics cluster is not as good as having a dedicated storage solution. Thanks for the good point :) Joseph

On Fri, Jun 12, 2015 at 9:58 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
As someone who has run production serving systems on top of Hadoop, I think this is risky. We've had substantial planned and unplanned downtime on the cluster (which is to be expected) and it would be bad for a pageview API to be impacted.

-Toby

On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto aotto@wikimedia.org wrote:

...
I think we could add Impala in storage technologies to assess.

I think we don’t want to build the pageview API on top of the Analytics Cluster.

On Jun 12, 2015, at 05:37, Joseph Allemandou jallemandou@wikimedia.org wrote:

I think we could add Impala in storage technologies to assess. It allows reading / computing straight from HDFS and should be fast enough for not too bad UEx. Maybe ?

On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns < mforns@wikimedia.org> wrote:

...
This thread seems to have paused for 1 or 2 days now.

So summarizing, the following storage technologies have been mentioned:

PostgreSQL

MySQL

Cassandra

Voldemort

And the following concerns have been raised on using something that:

We're already familiar with

Permits meta-analytics

Is queriable for json/tsv with little user setup

Withstands high throughput bulk inserts

Is queriable for slice and dice, even if we need to precompute

those

It seems that there aren't many candidates and that the discussion focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, say PostgreSQL and Cassandra?

Or, anyone with more thoughts or suggestions?

On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns < mforns@wikimedia.org> wrote:

...
If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is more like a key-value store. That's why I suggested Voldemort at first (which, they say, has a slightly faster read than Cassandra), but I see the preference for Cassandra for it being a known tool inside WMF. So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or querying them in a different way, key-value store would be a limitation.

On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < > dandreescu@wikimedia.org> wrote: > >> Eric, I think we should allow arbitrary querying on any dimension >> for that first data block. We could pre-aggregate all of those >> combinations pretty easily since the dimensions have very low cardinality. >> > > Are you thinking about something like > /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more > dimensions? >

only one more right now, called "agent_type". But this is just the first "cube" and we're planning a geo cube with more dimensions and are probably going to try and release data split up by access method (mobile, desktop, etc.) and other dimensions as people need them. This will be tricky as we try to protect privacy but that aside, the number of dimensions per endpoint, right now, seems to hover around 4 or 5.

> > >> For the article-level data, no, we'd want just basic timeseries >> querying. >> >> Thanks Gabriel, if you could point us to an example of these >> secondary RESTBase indices, that'd be interesting. >> > > The API used to define these tables is described in > https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, > and the algorithm used to keep those indexes up to date is described in > https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se... > and largely implemented in > https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se... > . >

very cool, thx.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

10 Jun 10 Jun

4:11 a.m.

Thanks, Gabriel – this is super-helpful.

Dan/Kevin: slightly OT, are we aware of any use case related to features that would be exposing PV data in production? I’ve seen mocks from the Discovery team with PV data embedded in articles or search interfaces and I’m not sure what their status is.

...

On Jun 9, 2015, at 6:59 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

I think Eric's original response got lost, so let me include it below:

...
...
...
dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true

So the size of this dataset would be something between 100M and 200M records per year, I think.

...
Could you expound on this a bit? Is it just the 3 dimensions above (day, project, type), or something more? Also, how will this be queried? Do we need to query by dimensions arbitrarily, or will the "higher" dimensions always be qualified with matches on the lower ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a & dim2=x & dimN=true)?

Out of the box, Cassandra is fairly limited in the kind of indexing it provides. Its main data structure is a distributed hash table, with the ability to select a single hierarchical range below a fixed key. This is why Eric asked about whether your query patterns are hierarchical.

There is some very limited support for secondary indexes, but those work very differently from relational databases (only equality & no efficient global queries), so are only useful in special cases. In RESTBase we have built an abstraction that lets us maintain more useful secondary indexes in regular Cassandra tables. However, this implementation still lacks features like dynamic index creation and authentication, so is not anywhere close to the functionality provided by a relational database.

In any case, I think it makes sense to isolate analytics backends from production content storage. Trouble in the page view backend should not affect content storage end points. We can still expose and document both in a single public API at /api/rest_v1/ using RESTBase, but are free to use any backend service or storage. The backend service could be built using the service template https://github.com/wikimedia/service-template-node and some $DB, a RESTBase instance or any other technology if it makes sense, as long as it exposes a REST-ish API that is reasonably easy to use and proxy.

Gabriel

On Tue, Jun 9, 2015 at 7:10 AM, Dario Taraborelli <dtaraborelli@wikimedia.org mailto:dtaraborelli@wikimedia.org> wrote: I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?

...
On Jun 9, 2015, at 8:39 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:

Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions.

On 8 June 2015 at 22:52, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote:

...
...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with labs folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not familiar with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never.

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

4:38 a.m.

...

Dan/Kevin: slightly OT, are we aware of any use case related to features that would be exposing PV data in production? I’ve seen mocks from the Discovery team with PV data embedded in articles or search interfaces and I’m not sure what their status is.

We have asks from people to expose some specific pageview data so they can include it in their features. This API should handle that, but we'll have to figure out the details. As far as use cases go, I feel like those are important but secondary to the "please get stats.grok.se working" request that we've had for much, much longer.

Andrew Otto

12:02 a.m.

...

p.s. I will never drink Bud Lite Lime. Like, never.

You have the wrong attitude. Pretend it is beer-soda, not beer. Beer + sprite is yummy!

(I’ll let someone else figure out how this advice also applies to the database analogy.)

...

On Jun 8, 2015, at 19:52, Dan Andreescu dandreescu@wikimedia.org wrote:

...
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with labs folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not familiar with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

12:18 a.m.

@otto: I believe that the high throughput bulk input use case will be difficult for a relational db (e.g. postgres) to handle. It will be interesting to see how well cassandra can handle the queries that people want to run. Tradeoffs...

@dario: RestBASE and Cassandra are definitely different things; we should take care to not conflate them.

-Toby

On Tue, Jun 9, 2015 at 9:02 AM, Andrew Otto aotto@wikimedia.org wrote:

...

p.s. I will never drink Bud Lite Lime. Like, never.

You have the wrong attitude. Pretend it is beer-soda, not beer. Beer + sprite is yummy!

(I’ll let someone else figure out how this advice also applies to the database analogy.)

On Jun 8, 2015, at 19:52, Dan Andreescu dandreescu@wikimedia.org wrote:

As always, I'd recommend that we go with tech we are familiar with --

...
mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.

WMF uses PostGRES for some things, no? Or is that is just in labs?

Since this data is meant to be fully public and queryable in any way, we could put it in the PostgreSQL instance on labs. We should check with labs folks, perhaps horse trade some hardware, but I think that would be a splendid solution.

However, and I'm trying to understate this in case people are not familiar with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for this. MySQL is suited for a lot of things, but analytics is not one of them.

p.s. I will never drink Bud Lite Lime. Like, never. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3481

Age (days ago)

3489

Last active (days ago)

analytics@lists.wikimedia.org

23 comments

10 participants

tags (0)

participants (10)

Adam Baso
Andrew Otto
Dan Andreescu
Dario Taraborelli
Eric Evans
Gabriel Wicke
Joseph Allemandou
Marcel Ruiz Forns
Oliver Keyes
Toby Negrin