Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.
Dario
Begin forwarded message:
From: Milimetric no-reply@phabricator.wikimedia.org Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format Date: May 21, 2015 at 9:31:36 AM PDT To: dario@wikimedia.org Reply-To: T44259+public+a4a5010c21d15736@phabricator.wikimedia.org
Milimetric added a comment.
I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:
February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here https://meta.wikimedia.org/wiki/Research:Page_view and the code is here https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly April 2015: we used this data internally to query but it overloaded our cluster and queries were slow May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50 Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.
Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:
Pipeline 1:
Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL. Pipeline 2:
Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method. Common Pipeline after we make the choice above:
Mondrian builds OLAP cubes and handles caching which is very useful with this much data point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.
TASK DETAIL https://phabricator.wikimedia.org/T44259 https://phabricator.wikimedia.org/T44259 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Milimetric Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
Thanks Dario, I should've thought to do the same. As I say in my comment, I'd love to get a discussion going here. This project has been in the dark and postponed for too long, and now that we're focusing on it everyone deserves our direct thoughts on it. Everyone here also has the right to directly influence our thoughts and plans. So please, don't be shy :)
On Thu, May 21, 2015 at 12:36 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.
Dario
Begin forwarded message:
*From: *Milimetric no-reply@phabricator.wikimedia.org *Subject: **[Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format* *Date: *May 21, 2015 at 9:31:36 AM PDT *To: *dario@wikimedia.org *Reply-To: *T44259+public+a4a5010c21d15736@phabricator.wikimedia.org
Milimetric added a comment.
I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:
- February 2015: with data flowing into the Hadoop cluster, we defined
which raw webrequests were "page views". The research is here https://meta.wikimedia.org/wiki/Research:Page_view and the code is here https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java
- March 2015: we used this page view definition to create a raw
pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly
- April 2015: we used this data internally to query but it overloaded
our cluster and queries were slow
- May 2015: we're working on an intermediate aggregation that would
total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50
Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.
Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:
Pipeline 1:
- Put daily aggregates into PostgreSQL. We think per article hourly
data would be too big for PostgreSQL.
Pipeline 2:
- Query data from the Hive tables directly with Impala. Impala is good
for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.
Common Pipeline after we make the choice above:
- Mondrian builds OLAP cubes and handles caching which is very useful
with this much data
- point RESTBase to Mondrian and expose API publicly at
restbase.wikimedia.org. This will be a reliable public API that people can build tools around
- point Saiku to Mondrian and make a new public website for
exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool
Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.
*TASK DETAIL* https://phabricator.wikimedia.org/T44259
*EMAIL PREFERENCES* https://phabricator.wikimedia.org/settings/panel/emailpreferences/
*To: *Milimetric *Cc: *Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Why is the work on Domas's data, which we know is incredibly unreliable? Because people still rely on it?
On 21 May 2015 at 12:39, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks Dario, I should've thought to do the same. As I say in my comment, I'd love to get a discussion going here. This project has been in the dark and postponed for too long, and now that we're focusing on it everyone deserves our direct thoughts on it. Everyone here also has the right to directly influence our thoughts and plans. So please, don't be shy :)
On Thu, May 21, 2015 at 12:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.
Dario
Begin forwarded message:
From: Milimetric no-reply@phabricator.wikimedia.org Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format Date: May 21, 2015 at 9:31:36 AM PDT To: dario@wikimedia.org Reply-To: T44259+public+a4a5010c21d15736@phabricator.wikimedia.org
Milimetric added a comment.
I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:
February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here and the code is here March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly April 2015: we used this data internally to query but it overloaded our cluster and queries were slow May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50
Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.
Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:
Pipeline 1:
Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL.
Pipeline 2:
Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.
Common Pipeline after we make the choice above:
Mondrian builds OLAP cubes and handles caching which is very useful with this much data point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool
Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.
TASK DETAIL https://phabricator.wikimedia.org/T44259
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Milimetric Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
the title of the Phab ticket is obsolete, the plan is not to work off the existing hourly PV dumps, per Dan’s note.
On May 21, 2015, at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Why is the work on Domas's data, which we know is incredibly unreliable? Because people still rely on it?
On 21 May 2015 at 12:39, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks Dario, I should've thought to do the same. As I say in my comment, I'd love to get a discussion going here. This project has been in the dark and postponed for too long, and now that we're focusing on it everyone deserves our direct thoughts on it. Everyone here also has the right to directly influence our thoughts and plans. So please, don't be shy :)
On Thu, May 21, 2015 at 12:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.
Dario
Begin forwarded message:
From: Milimetric no-reply@phabricator.wikimedia.org Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format Date: May 21, 2015 at 9:31:36 AM PDT To: dario@wikimedia.org Reply-To: T44259+public+a4a5010c21d15736@phabricator.wikimedia.org
Milimetric added a comment.
I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:
February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here and the code is here March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly April 2015: we used this data internally to query but it overloaded our cluster and queries were slow May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50
Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.
Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:
Pipeline 1:
Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL.
Pipeline 2:
Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.
Common Pipeline after we make the choice above:
Mondrian builds OLAP cubes and handles caching which is very useful with this much data point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool
Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.
TASK DETAIL https://phabricator.wikimedia.org/T44259
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Milimetric Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Sweet :)
On 21 May 2015 at 12:56, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
the title of the Phab ticket is obsolete, the plan is not to work off the existing hourly PV dumps, per Dan’s note.
On May 21, 2015, at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Why is the work on Domas's data, which we know is incredibly unreliable? Because people still rely on it?
On 21 May 2015 at 12:39, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks Dario, I should've thought to do the same. As I say in my comment, I'd love to get a discussion going here. This project has been in the dark and postponed for too long, and now that we're focusing on it everyone deserves our direct thoughts on it. Everyone here also has the right to directly influence our thoughts and plans. So please, don't be shy :)
On Thu, May 21, 2015 at 12:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.
Dario
Begin forwarded message:
From: Milimetric no-reply@phabricator.wikimedia.org Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format Date: May 21, 2015 at 9:31:36 AM PDT To: dario@wikimedia.org Reply-To: T44259+public+a4a5010c21d15736@phabricator.wikimedia.org
Milimetric added a comment.
I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:
February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here and the code is here March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly April 2015: we used this data internally to query but it overloaded our cluster and queries were slow May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50
Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.
Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:
Pipeline 1:
Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL.
Pipeline 2:
Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.
Common Pipeline after we make the choice above:
Mondrian builds OLAP cubes and handles caching which is very useful with this much data point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool
Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.
TASK DETAIL https://phabricator.wikimedia.org/T44259
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Milimetric Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics