Dan – thanks for the thorough update, hope you don’t mind if I repost this to the
analytics list – I bet several people on this list are eager to know where this is going.
Dario
Begin forwarded message:
From: Milimetric <no-reply(a)phabricator.wikimedia.org>
Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in
semi-publicly queryable database format
Date: May 21, 2015 at 9:31:36 AM PDT
To: dario(a)wikimedia.org
Reply-To: T44259+public+a4a5010c21d15736(a)phabricator.wikimedia.org
Milimetric added a comment.
I'd love to start a more open discussion about our progress on this. Here's the
recent history and where we are:
February 2015: with data flowing into the Hadoop cluster, we defined which raw
webrequests were "page views". The research is here
<https://meta.wikimedia.org/wiki/Research:Page_view> and the code is here
<https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java>
March 2015: we used this page view definition to create a raw pageview table in Hadoop.
This is queryable by Hive but it's about 3 TB per day of data. So we don't have
the resources to expose it publicly
April 2015: we used this data internally to query but it overloaded our cluster and
queries were slow
May 2015: we're working on an intermediate aggregation that would total up page
counts by hour over the dimensions that we think most people care about. We estimate this
will cut down size by a factor of 50
Progress has been slow mostly because Event Logging is our main priority and it's
been having serious scaling issues. We think we have a good handle on the Event Logging
issues after our latest patch, and in a week or so we're going to mostly focus on the
Pageview API.
Once this new intermediate aggregation is done, we'll hopefully free up some cluster
resources and be in a better position to load up a public API. Right now, we are
evaluating two possible data pipelines:
Pipeline 1:
Put daily aggregates into PostgreSQL. We think per article hourly data would be too big
for PostgreSQL.
Pipeline 2:
Query data from the Hive tables directly with Impala. Impala is good for medium to small
data, but is much faster than Hive. We might be able to query the hourly data if we use
this method.
Common Pipeline after we make the choice above:
Mondrian builds OLAP cubes and handles caching which is very useful with this much data
point RESTBase to Mondrian and expose API publicly at
restbase.wikimedia.org. This will
be a reliable public API that people can build tools around
point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is
an open source OLAP cube visualization and analysis tool
Hope that helps. As we get closer to making this API real, we would love your input,
participation, questions, etc.
TASK DETAIL
https://phabricator.wikimedia.org/T44259
<https://phabricator.wikimedia.org/T44259>
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
<https://phabricator.wikimedia.org/settings/panel/emailpreferences/>
To: Milimetric
Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc,
Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey,
Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill