Dan –  thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.

Dario

Begin forwarded message:

From: Milimetric <no-reply@phabricator.wikimedia.org>
Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format
Date: May 21, 2015 at 9:31:36 AM PDT
To: dario@wikimedia.org
Reply-To: T44259+public+a4a5010c21d15736@phabricator.wikimedia.org

Milimetric added a comment.

I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:

  • February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here and the code is here
  • March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly
  • April 2015: we used this data internally to query but it overloaded our cluster and queries were slow
  • May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50

Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.

Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:

Pipeline 1:

  • Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL.

Pipeline 2:

  • Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.

Common Pipeline after we make the choice above:

  • Mondrian builds OLAP cubes and handles caching which is very useful with this much data
  • point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around
  • point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool

Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.




To: Milimetric
Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill