On Sat, Mar 25, 2017 at 2:29 AM, Dan Andreescu <dandreescu@wikimedia.org> wrote:

The Analytics cluster refers more to the whole infrastructure, including the raw streams of data, the processing of that data, and all the software that goes into configuring, monitoring, and maintaining it. From our users' point of view, they can think of the cluster as the big machine that lets them compute. The Data Lake is just the data, so the serving layer of that big machinery. In a Venn diagram the data lake would be inside the Analytics cluster. So we could stop talking about the cluster publicly and just refer to the data lake, since that's becoming the interface that people have to our data. But we'll still talk about the Analytics cluster internally.

Hope that clarifies, and it's not like this is written in stone, we're always open to any ideas people have on making this easier to understand.

From: Neil Patel Quinn
Sent: Friday, March 24, 2017 17:20
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: [Analytics] Data Lake documentation on Wikitech

Hey Analytics!

I'm working on updating the Wikitech Analytics documentation based on my new understanding of the Data Lake. I've already clarified that there's no separate thing called the "Data Warehouse" (other than some experiments from 2015), but I still don't understand the difference between the Analytics Cluster and the Data Lake.

From what I learned yesterday, the Data Lake is everything stored in the Hadoop cluster (including pageview, mediacounts, last-access, and edit history data), even when it can't be usefully joined together.

But that seems to be the same thing as the Analytics Cluster ("the Hadoop cluster and its related components"). Is it possible to pick one name ("Data Lake" or "Analytics Cluster") and stick with it? I promise you it'll make the whole system much easier to understand for outsiders :)

--
Neil Patel Quinn, product analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics