Re: [Analytics] [Technical] Pick storage for pageview cubes

16 Jun 2015

      OK, so I think we have our candidates:
1) PostgreSQL
2) Cassandra
We can speak about this at our next tasking meeting.
If someone has more suggestions or comments, we've still a couple days
until then.
Thank you all!
Marcel
On Sat, Jun 13, 2015 at 11:37 AM, Joseph Allemandou <
jallemandou@wikimedia.org> wrote:
...
Andrew, Toby, that makes perfect sense.
While thinking that the distributed aspect of Impala would handle high
availability issues, I very much understand that having a front-end system
relying on the analytics cluster is not as good as having a dedicated
storage solution.
Thanks for the good point :)
Joseph
On Fri, Jun 12, 2015 at 9:58 PM, Toby Negrin tnegrin@wikimedia.org
wrote:
...
As someone who has run production serving systems on top of Hadoop, I
think this is risky. We've had substantial planned and unplanned downtime
on the cluster (which is to be expected) and it would be bad for a pageview
API to be impacted.
-Toby
On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto aotto@wikimedia.org wrote:
...
I think we could add Impala in storage technologies to assess.
I think we don’t want to build the pageview API on top of the Analytics
Cluster.
On Jun 12, 2015, at 05:37, Joseph Allemandou jallemandou@wikimedia.org
wrote:
I think we could add Impala in storage technologies to assess.
It allows reading / computing straight from HDFS and should be fast
enough for not too bad UEx.
Maybe ?
On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <
mforns@wikimedia.org> wrote:
...
This thread seems to have paused for 1 or 2 days now.
So summarizing, the following storage technologies have been mentioned:

PostgreSQL
MySQL
Cassandra
Voldemort

And the following concerns have been raised on using something that:

We're already familiar with
Permits meta-analytics
Is queriable for json/tsv with little user setup
Withstands high throughput bulk inserts
Is queriable for slice and dice, even if we need to precompute

those
It seems that there aren't many candidates and that the discussion
focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one
of each type, say PostgreSQL and Cassandra?
Or, anyone with more thoughts or suggestions?
On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <
mforns@wikimedia.org> wrote:
...
If we are going to completely denormalize the data sets for
anonymization,
and we expect just slice and dice queries to the database,
I think we wouldn't take much advantage of a relational DB,
because it wouldn't need to aggregate values, slice or dice,
all slices and dices would be precomputed, right?
It seems to me that the nature of this denormalized/anonymized data
sets is more like a key-value store. That's why I suggested Voldemort at
first (which, they say, has a slightly faster read than Cassandra), but I
see the preference for Cassandra for it being a known tool inside WMF.
So, +1 for Cassandra!
However, if we foresee the need of adding more data sets to the same
DB, or querying them in a different way, key-value store would be a
limitation.
On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <
dandreescu@wikimedia.org> wrote:
...
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwicke@wikimedia.org
wrote:
> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <
> dandreescu@wikimedia.org> wrote:
>
>> Eric, I think we should allow arbitrary querying on any dimension
>> for that first data block.  We could pre-aggregate all of those
>> combinations pretty easily since the dimensions have very low cardinality.
>>
>
> Are you thinking about something like
> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
> dimensions?
>
only one more right now, called "agent_type".  But this is just the
first "cube" and we're planning a geo cube with more dimensions and are
probably going to try and release data split up by access method (mobile,
desktop, etc.) and other dimensions as people need them.  This will be
tricky as we try to protect privacy but that aside, the number of
dimensions per endpoint, right now, seems to hover around 4 or 5.
>
>
>> For the article-level data, no, we'd want just basic timeseries
>> querying.
>>
>> Thanks Gabriel, if you could point us to an example of these
>> secondary RESTBase indices, that'd be interesting.
>>
>
>  The API used to define these tables is described in
> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
> and the algorithm used to keep those indexes up to date is described in
> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/Se...
> and largely implemented in
> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/se...
> .
>
very cool, thx.

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
 _______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Technical] Pick storage for pageview cubes