Re: [Analytics] [Ops] Datastax or Cloudera for Analytics Batch Processing

25 Sep 2012

I don't think NameNode frailty is that big of an issue.  Hadoop with HDFS
has been used successfully for years at scales that dwarf what we are
considering, from its inception at yahoo across 40k nodes, to at facebook
today.  There are multiple approaches to NameNode redundancy.  It can
effectively be an ops issue involving 2 or 3 servers utilizing old tested
tools for making applications not designed for high availability, highly
available.  But Cloudera's out-of-the-box solution here looks reasonable at
first glance, and there are other alternatives such as Facebook's
open-source NameNode replacement.

With Facebook recently blogging  "Whether your system has hundreds of nodes
or thousands, HDFS is the most scalable and most reliable open-source
distributed filesystem available." (
https://www.facebook.com/notes/facebook-engineering/under-the-hood-hadoop-d…)
I'm inclined to give their opinion greater credibility than any marketing
copy, in the absence of our own data.

Take the NameNode controversy out of the picture, and what is the pitch for
DSE?

What data processing patterns are better handled by tools built on top of
Hadoop than by native Cassandra queries?  Are such tools being considered
only because they are better supported and easier to use by analysts, or
are they technically superior in some ways to a pure Cassandra solution?
Does Cassandra's on-disk storage format always map well to the usage
patterns of tools like Hive?

Wondering generically about the use of Cassandra as a distributed
file-system replacement, would you consider it as an object store for media
files?  It seems like it isn't well suited for such a task, while HDFS is a
potential option for replacing Swift and can be used for video streaming.
This may seem irrelevant to Kraken, and yet the fact that HDFS can stream
large objects while Cassandra cannot may impact the performance of data
analysis tools designed for HDFS.  There's also the (perhaps small)
possibility that we will in fact replace Swift with HDFS.  If that occurs
and keeping an HDFS cluster available becomes a primary duty of the ops
team, they should be able to take responsibility for a Kraken HDFS cluster
as well.

-Asher

On Tue, Sep 25, 2012 at 7:37 AM, Andrew Otto &lt;otto(a)wikimedia.org&gt; wrote:

...
  As discussed in the TechOps meeting yesterday, the
Analytics team is
 evaluating two different Hadoop distributions for the batch processing
 layer of the Kraken cluster:  Cloudera Hadoop 4 (CDH4) and Datastax
 Enterprise (DSE).  I'm going to try to describe from a very high level why
 we think DSE is purrrrrty cool, and why CDH4 sounds like a relative
 headache.  I'll then ask the questions we hope to answer soon.

 (Quick disclaimer:  This is a totally biased and incomplete summary of DSE
 vs CDH4.  I am highlighting weaknesses of CDH4 in order to illustrate why
 we are considering DSE.  If we are allowed to use DSE, then we will have
 more work to do weighing pros and cons of both solutions.)

 CDH4 is just a well packaged distribution of Hadoop and some of the most
 useful Hadoop tools.  The core of Hadoop is MapReduce and the Hadoop
 Distributed Filesystem (HDFS).  HDFS allows huge files to be written across
 any number of machines.  MapReduce jobs can then be written to do parallel
 processing of these files.

 Hadoop has a single NameNode that is responsible for managing file
 metadata in HDFS.  This is a notorious single point of failure.

 CDH4 attempts to solve this problem with the introduction of a Standby
 NameNode[1].  This requires that both the Active and Standby NameNodes
 share a filesystem (via NFS or something similar).  The Standby nodes uses
 this to synchronize edits to its own file metadata namespace.  If the
 Active NameNode goes down, the Standby can take over and *should* have the
 exact same metadata stored as the Active.  Administrators have to take
 special care to ensure that only one of the two NameNodes is active at
 once, or metadata corruption could result.  Read about Fencing[2] for more
 info.

 Not only is the NameNode a SPOF, but it doesn't scale well when there are
 many files or many metadata operations.  More recent Hadoop releases solve
 this providing 'HDFS Federation'[3].  Basically, this is just multiple
 NameNodes sharing the independent spaces of the same data nodes.  This is
 all more configuration to get Hadoop to scale, something it is supposed to
 be good at out of the box.

 All of the above also comes with lots of configuration to maintain and
 tweak to make work.  DSE is supposed to solve these problems and be much
 easier to work with.

 DSE is Hadoop without HDFS.  Instead, Datastax has written an HDFS
 emulation layer on top of Cassandra (CFS).  The huge benefit here is that
 from the clients viewpoint, everything works just like Hadoop[4].
  MapReduce jobs and all of the fancy Hadoop tools still work.  But there's
 no NameNode to deal with.  In Cassandra all nodes are peers.  This allows
 for a much more homogenous cluster and less configuration[5].  Cassandra
 automatically rebalances its data when a node goes down.

 Ah but DSE has WMF related problems of its own!  Most importantly, DSE is
 not 100% open source*.  The core of DSE is open source components available
 under the Apache license (Hadoop, Cassandra, etc.).  As far as I can tell,
 anything that is not an Apache project is proprietary.  Datastax has a
 comparison[7] of its Community vs. Enterprise versions.  This includes
 packaging and wrapper CLI tools, but most importantly the Cassandra File
 System (CFS) emulation layer.  This piece is what makes DSE so attractive.

 Also note that DSE is not free.  It is for development purposes, but it
 has a license that we'd have to buy if we want to run it in production.
  However, Diederik knows one of the founders of Datastax who might give it
 to us at a discount, or perhaps for free.

 Okokok.  So I'm starting this thread in order to answer my big question:
  Are we allowed to use DSE even if it is not 100% open source?  If the
 answer is an easy 'no', then our choice is easy:  we will use CDH4.  Is it
 possible to get an answer to this question before we go down the road of
 dedicating time to evaluating and learning DSE?

 tl;dr:  Datastax is cooler than Cloudera, but Datastax is not 100% open
 source.  Can we still use it?

 - otto + the Analytics Team

 *much like Cloudera for Hadoop, DataStax publishes tested, stable
 distributions of Cassandra and ecosystem tools bundled together into one
 package. In addition to selling support, their Enterprise edition contains
 proprietary code that, among other things, provides HDFS emulation on top
 Cassandra. This code originated as Brisk[7], and was forked when the
 developers founded DataStax.

 [1]
 https://ccp.cloudera.com/display/CDH4DOC/Introduction+to+Hadoop+High+Availa…
 [2]
 https://ccp.cloudera.com/display/FREE4DOC/Configuring+HDFS+High+Availability
 [3]
 http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federati…
 [4] http://www.datastax.com/faq#dse-2
 [5]
 http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/
 [6] http://wiki.apache.org/cassandra/HadoopSupport
 [7] https://github.com/riptano/brisk ,
 https://github.com/riptano/brisk-hadoop-common

 _______________________________________________
 Ops mailing list
 Ops(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/ops

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Ops] Datastax or Cloudera for Analytics Batch Processing