On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <gwicke(a)wikimedia.org> wrote:
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu
<dandreescu(a)wikimedia.org>
wrote:
Eric, I think we should allow arbitrary querying
on any dimension for
that first data block. We could pre-aggregate all of those combinations
pretty easily since the dimensions have very low cardinality.
Are you thinking about something like
/{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
dimensions?
only one more right now, called "agent_type". But this is just the first
"cube" and we're planning a geo cube with more dimensions and are probably
going to try and release data split up by access method (mobile, desktop,
etc.) and other dimensions as people need them. This will be tricky as we
try to protect privacy but that aside, the number of dimensions per
endpoint, right now, seems to hover around 4 or 5.
For the article-level data, no, we'd want
just basic timeseries querying.
Thanks Gabriel, if you could point us to an example of these secondary
RESTBase indices, that'd be interesting.
The API used to define these tables is described in
https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
and the algorithm used to keep those indexes up to date is described in
https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/S…
and largely implemented in
https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/s…
.
very cool, thx.