This thread seems to have paused for 1 or 2 days now.
So summarizing, the following storage technologies have been mentioned:
- PostgreSQL
- MySQL
- Cassandra
- Voldemort
And the following concerns have been raised on using something that:
- We're already familiar with
- Permits meta-analytics
- Is queriable for json/tsv with little user setup
- Withstands high throughput bulk inserts
- Is queriable for slice and dice, even if we need to precompute those
It seems that there aren't many candidates and that the discussion focused
on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each
type, say PostgreSQL and Cassandra?
Or, anyone with more thoughts or suggestions?
On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org>
wrote:
If we are going to completely denormalize the
data sets for anonymization,
and we expect just slice and dice queries to the database,
I think we wouldn't take much advantage of a relational DB,
because it wouldn't need to aggregate values, slice or dice,
all slices and dices would be precomputed, right?
It seems to me that the nature of this denormalized/anonymized data sets
is more like a key-value store. That's why I suggested Voldemort at first
(which, they say, has a slightly faster read than Cassandra), but I see the
preference for Cassandra for it being a known tool inside WMF.
So, +1 for Cassandra!
However, if we foresee the need of adding more data sets to the same DB,
or querying them in a different way, key-value store would be a limitation.
On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <gwicke(a)wikimedia.org>
wrote:
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu
<
dandreescu(a)wikimedia.org> wrote:
> Eric, I think we should allow arbitrary querying on any dimension for
> that first data block. We could pre-aggregate all of those combinations
> pretty easily since the dimensions have very low cardinality.
>
Are you thinking about something like
/{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
dimensions?
only one more right now, called "agent_type". But this is just the
first "cube" and we're planning a geo cube with more dimensions and are
probably going to try and release data split up by access method (mobile,
desktop, etc.) and other dimensions as people need them. This will be
tricky as we try to protect privacy but that aside, the number of
dimensions per endpoint, right now, seems to hover around 4 or 5.
> For the article-level data, no, we'd want just basic timeseries
> querying.
>
> Thanks Gabriel, if you could point us to an example of these secondary
> RESTBase indices, that'd be interesting.
>
The API used to define these tables is described in
https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
and the algorithm used to keep those indexes up to date is described in
https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/S…
and largely implemented in
https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/s…
.
very cool, thx.
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org