Thanks, Gabriel – this is super-helpful. 

Dan/Kevin: slightly OT, are we aware of any use case related to features that would be exposing PV data in production? I’ve seen mocks from the Discovery team with PV data embedded in articles or search interfaces and I’m not sure what their status is.

On Jun 9, 2015, at 6:59 PM, Gabriel Wicke <gwicke@wikimedia.org> wrote:

I think Eric's original response got lost, so let me include it below:

>>> dim1,  dim2,  dim3,  ...,  dimN,  val
>>>    a,  null,  null,  ...,  null,   15    // pv for dim1=a
>>>    a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
>>>    a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x &
>>> dim3=1
>>>    a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ...
>>> & dimN=true
>>>
>>> So the size of this dataset would be something between 100M and 200M
>>> records per year, I think.

> Could you expound on this a bit?  Is it just the 3 dimensions above
> (day, project, type), or something more? Also, how will this be
> queried?  Do we need to query by dimensions arbitrarily, or will the
> "higher" dimensions always be qualified with matches on the lower
> ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a
> & dim2=x & dimN=true)?

Out of the box, Cassandra is fairly limited in the kind of indexing it provides. Its main data structure is a distributed hash table, with the ability to select a single hierarchical range below a fixed key. This is why Eric asked about whether your query patterns are hierarchical.

There is some very limited support for secondary indexes, but those work very differently from relational databases (only equality & no efficient global queries), so are only useful in special cases. In RESTBase we have built an abstraction that lets us maintain more useful secondary indexes in regular Cassandra tables. However, this implementation still lacks features like dynamic index creation and authentication, so is not anywhere close to the functionality provided by a relational database.

In any case, I think it makes sense to isolate analytics backends from production content storage. Trouble in the page view backend should not affect content storage end points. We can still expose and document both in a single public API at /api/rest_v1/ using RESTBase, but are free to use any backend service or storage. The backend service could be built using the service template and some $DB, a RESTBase instance or any other technology if it makes sense, as long as it exposes a REST-ish API that is reasonably easy to use and proxy.

Gabriel

On Tue, Jun 9, 2015 at 7:10 AM, Dario Taraborelli <dtaraborelli@wikimedia.org> wrote:
I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?

> On Jun 9, 2015, at 8:39 AM, Oliver Keyes <okeyes@wikimedia.org> wrote:
>
> Remember that (as things currently stand) putting the thing on labs
> means meta-analytics ("how are the cubes being used?") being a pain in
> the backside to integrate with our existing storage solutions.
>
> On 8 June 2015 at 22:52, Dan Andreescu <dandreescu@wikimedia.org> wrote:
>>> As always, I'd recommend that we go with tech we are familiar with --
>>> mysql or cassandra. We have a cassandra committer on staff who would be able
>>> to answer these questions in detail.
>>>
>>>
>>> WMF uses PostGRES for some things, no?  Or is that is just in labs?
>>
>>
>> Since this data is meant to be fully public and queryable in any way, we
>> could put it in the PostgreSQL instance on labs.  We should check with labs
>> folks, perhaps horse trade some hardware, but I think that would be a
>> splendid solution.
>>
>> However, and I'm trying to understate this in case people are not familiar
>> with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for
>> this.  MySQL is suited for a lot of things, but analytics is not one of
>> them.
>>
>> p.s. I will never drink Bud Lite Lime.  Like, never.
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics