Re: [Analytics] [Technical] Pick storage for pageview cubes

9 Jun 2015


      Thanks, Gabriel – this is super-helpful.
Dan/Kevin: slightly OT, are we aware of any use case related to features that would be exposing PV data in production? I’ve seen mocks from the Discovery team with PV data embedded in articles or search interfaces and I’m not sure what their status is.
...
On Jun 9, 2015, at 6:59 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
I think Eric's original response got lost, so let me include it below:
...
...
...
dim1,  dim2,  dim3,  ...,  dimN,  val
   a,  null,  null,  ...,  null,   15    // pv for dim1=a
   a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
   a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x &
dim3=1
   a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ...
& dimN=true
So the size of this dataset would be something between 100M and 200M
records per year, I think.
...
Could you expound on this a bit?  Is it just the 3 dimensions above
(day, project, type), or something more? Also, how will this be
queried?  Do we need to query by dimensions arbitrarily, or will the
"higher" dimensions always be qualified with matches on the lower
ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a
& dim2=x & dimN=true)?
Out of the box, Cassandra is fairly limited in the kind of indexing it provides. Its main data structure is a distributed hash table, with the ability to select a single hierarchical range below a fixed key. This is why Eric asked about whether your query patterns are hierarchical.
There is some very limited support for secondary indexes, but those work very differently from relational databases (only equality & no efficient global queries), so are only useful in special cases. In RESTBase we have built an abstraction that lets us maintain more useful secondary indexes in regular Cassandra tables. However, this implementation still lacks features like dynamic index creation and authentication, so is not anywhere close to the functionality provided by a relational database.
In any case, I think it makes sense to isolate analytics backends from production content storage. Trouble in the page view backend should not affect content storage end points. We can still expose and document both in a single public API at /api/rest_v1/ using RESTBase, but are free to use any backend service or storage. The backend service could be built using the service template https://github.com/wikimedia/service-template-node and some $DB, a RESTBase instance or any other technology if it makes sense, as long as it exposes a REST-ish API that is reasonably easy to use and proxy.
Gabriel
On Tue, Jun 9, 2015 at 7:10 AM, Dario Taraborelli <dtaraborelli@wikimedia.org mailto:dtaraborelli@wikimedia.org> wrote:
I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?
...
On Jun 9, 2015, at 8:39 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
Remember that (as things currently stand) putting the thing on labs
means meta-analytics ("how are the cubes being used?") being a pain in
the backside to integrate with our existing storage solutions.
On 8 June 2015 at 22:52, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote:
...
...
As always, I'd recommend that we go with tech we are familiar with --
mysql or cassandra. We have a cassandra committer on staff who would be able
to answer these questions in detail.
WMF uses PostGRES for some things, no?  Or is that is just in labs?
Since this data is meant to be fully public and queryable in any way, we
could put it in the PostgreSQL instance on labs.  We should check with labs
folks, perhaps horse trade some hardware, but I think that would be a
splendid solution.
However, and I'm trying to understate this in case people are not familiar
with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for
this.  MySQL is suited for a lot of things, but analytics is not one of
them.
p.s. I will never drink Bud Lite Lime.  Like, never.

Analytics mailing list
Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Research Analyst
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Technical] Pick storage for pageview cubes