Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality. For the article-level data, no, we'd want just basic timeseries querying.
Thanks Gabriel, if you could point us to an example of these secondary RESTBase indices, that'd be interesting.
As for relational databases not being able to handle this, I remain optimistic. Toby, we're planning on diminishing the time resolution the way RRD files do, so growth over time should not be an issue. So the only problem is can we insert hourly data in bulk without killing the db? We'll have to run tests but assuming each hour is around 1GB, PostgreSQL should be able to insert this in a few minutes without too many problems (using COPY or pg_bulkload).
Finding good alternatives to investigate is important though, the more tailored to the problem the better. So I'd like to work some more on this list:
* Cassandra * PostgreSQL * Voldemort
(lol, otto :))
On Tue, Jun 9, 2015 at 12:59 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
I think Eric's original response got lost, so let me include it below:
dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x &
...
& dimN=true
So the size of this dataset would be something between 100M and 200M records per year, I think.
Could you expound on this a bit? Is it just the 3 dimensions above (day, project, type), or something more? Also, how will this be queried? Do we need to query by dimensions arbitrarily, or will the "higher" dimensions always be qualified with matches on the lower ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a & dim2=x & dimN=true)?
Out of the box, Cassandra is fairly limited in the kind of indexing it provides. Its main data structure is a distributed hash table, with the ability to select a single hierarchical range below a fixed key. This is why Eric asked about whether your query patterns are hierarchical.
There is some very limited support for secondary indexes, but those work very differently from relational databases (only equality & no efficient global queries), so are only useful in special cases. In RESTBase we have built an abstraction that lets us maintain more useful secondary indexes in regular Cassandra tables. However, this implementation still lacks features like dynamic index creation and authentication, so is not anywhere close to the functionality provided by a relational database.
In any case, I think it makes sense to isolate analytics backends from production content storage. Trouble in the page view backend should not affect content storage end points. We can still expose and document both in a single public API at /api/rest_v1/ using RESTBase, but are free to use any backend service or storage. The backend service could be built using the service template https://github.com/wikimedia/service-template-node and some $DB, a RESTBase instance or any other technology if it makes sense, as long as it exposes a REST-ish API that is reasonably easy to use and proxy.
Gabriel
On Tue, Jun 9, 2015 at 7:10 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra?
On Jun 9, 2015, at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions.
On 8 June 2015 at 22:52, Dan Andreescu dandreescu@wikimedia.org
wrote:
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would
be able
to answer these questions in detail.
WMF uses PostGRES for some things, no? Or is that is just in labs?
Since this data is meant to be fully public and queryable in any way,
we
could put it in the PostgreSQL instance on labs. We should check with
labs
folks, perhaps horse trade some hardware, but I think that would be a splendid solution.
However, and I'm trying to understate this in case people are not
familiar
with my hyperbolic style, I'd rather drink Bud Lite Lime than use
MySQL for
this. MySQL is suited for a lot of things, but analytics is not one of them.
p.s. I will never drink Bud Lite Lime. Like, never.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics