Ja, +1 to what dan said.  Data Lake is our term for the ability to serve queries on large refined datasets to users.  Analytics Cluster refers to (almost) all of our infrastructure.  Hadoop will likely always power part of Data Lake, but in an ideal world, we’d have some other lower latency query system / service that would abstract away some of the trickiness of Hadoop.  E.g. Druid (and Pivot).



On Sat, Mar 25, 2017 at 2:29 AM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
The Analytics cluster refers more to the whole infrastructure, including the raw streams of data, the processing of that data, and all the software that goes into configuring, monitoring, and maintaining it. From our users' point of view, they can think of the cluster as the big machine that lets them compute. The Data Lake is just the data, so the serving layer of that big machinery. In a Venn diagram the data lake would be inside the Analytics cluster. So we could stop talking about the cluster publicly and just refer to the data lake, since that's becoming the interface that people have to our data. But we'll still talk about the Analytics cluster internally.

Hope that clarifies, and it's not like this is written in stone, we're always open to any ideas people have on making this easier to understand.

From: Neil Patel Quinn
Sent: Friday, March 24, 2017 17:20
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: [Analytics] Data Lake documentation on Wikitech

Hey Analytics!

I'm working on updating the Wikitech Analytics documentation based on my new understanding of the Data Lake. I've already clarified that there's no separate thing called the "Data Warehouse" (other than some experiments from 2015), but I still don't understand the difference between the Analytics Cluster and the Data Lake.

From what I learned yesterday, the Data Lake is everything stored in the Hadoop cluster (including pageview, mediacounts, last-access, and edit history data), even when it can't be usefully joined together.

But that seems to be the same thing as the Analytics Cluster ("the Hadoop cluster and its related components"). Is it possible to pick one name ("Data Lake" or "Analytics Cluster") and stick with it? I promise you it'll make the whole system much easier to understand for outsiders :)

--
Neil Patel Quinn, product analyst
Wikimedia Foundation



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics