Ja, +1 to what dan said. Data Lake is our term for the ability to serve
queries on large refined datasets to users. Analytics Cluster refers to
(almost) all of our infrastructure. Hadoop will likely always power part
of Data Lake, but in an ideal world, we’d have some other lower latency
query system / service that would abstract away some of the trickiness of
Hadoop. E.g. Druid (and Pivot).
On Sat, Mar 25, 2017 at 2:29 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
The Analytics cluster refers more to the whole
the raw streams of data, the processing of that data, and all the software
that goes into configuring, monitoring, and maintaining it. From our users'
point of view, they can think of the cluster as the big machine that lets
them compute. The Data Lake is just the data, so the serving layer of that
big machinery. In a Venn diagram the data lake would be inside the
Analytics cluster. So we could stop talking about the cluster publicly and
just refer to the data lake, since that's becoming the interface that
people have to our data. But we'll still talk about the Analytics cluster
Hope that clarifies, and it's not like this is written in stone, we're
always open to any ideas people have on making this easier to understand.
*From: *Neil Patel Quinn
*Sent: *Friday, March 24, 2017 17:20
*To: *A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
*Reply To: *A mailing list for the Analytics Team at WMF and everybody
who has an interest in Wikipedia and analytics.
*Subject: *[Analytics] Data Lake documentation on Wikitech
I'm working on updating the Wikitech Analytics documentation
<https://wikitech.wikimedia.org/wiki/Analytics> based on my new
understanding of the Data Lake. I've already clarified that there's no
separate thing called the "Data Warehouse" (other than some experiments
from 2015), but I still don't understand the difference between the Analytics
Cluster <https://wikitech.wikimedia.org/wiki/Analytics/Cluster> and the Data
From what I learned yesterday, the Data Lake is everything stored in the
Hadoop cluster (including pageview, mediacounts, last-access, and edit
history data), even when it can't be usefully joined together.
But that seems to be the same thing as the Analytics Cluster ("the Hadoop
cluster and its related components"). Is it possible to pick one name
("Data Lake" or "Analytics Cluster") and stick with it? I promise you
make the whole system much easier to understand for outsiders :)
Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>,
Analytics mailing list