[Analytics] Datastax or Cloudera for Analytics Batch Processing

25 Sep 2012

As discussed in the TechOps meeting yesterday, the Analytics team is evaluating two
different Hadoop distributions for the batch processing layer of the Kraken cluster: 
Cloudera Hadoop 4 (CDH4) and Datastax Enterprise (DSE).  I'm going to try to describe
from a very high level why we think DSE is purrrrrty cool, and why CDH4 sounds like a
relative headache.  I'll then ask the questions we hope to answer soon.

(Quick disclaimer:  This is a totally biased and incomplete summary of DSE vs CDH4.  I am
highlighting weaknesses of CDH4 in order to illustrate why we are considering DSE.  If we
are allowed to use DSE, then we will have more work to do weighing pros and cons of both
solutions.)

CDH4 is just a well packaged distribution of Hadoop and some of the most useful Hadoop
tools.  The core of Hadoop is MapReduce and the Hadoop Distributed Filesystem (HDFS). 
HDFS allows huge files to be written across any number of machines.  MapReduce jobs can
then be written to do parallel processing of these files.

Hadoop has a single NameNode that is responsible for managing file metadata in HDFS.  This
is a notorious single point of failure. 

CDH4 attempts to solve this problem with the introduction of a Standby NameNode[1].  This
requires that both the Active and Standby NameNodes share a filesystem (via NFS or
something similar).  The Standby nodes uses this to synchronize edits to its own file
metadata namespace.  If the Active NameNode goes down, the Standby can take over and
*should* have the exact same metadata stored as the Active.  Administrators have to take
special care to ensure that only one of the two NameNodes is active at once, or metadata
corruption could result.  Read about Fencing[2] for more info.

Not only is the NameNode a SPOF, but it doesn't scale well when there are many files
or many metadata operations.  More recent Hadoop releases solve this providing 'HDFS
Federation'[3].  Basically, this is just multiple NameNodes sharing the independent
spaces of the same data nodes.  This is all more configuration to get Hadoop to scale,
something it is supposed to be good at out of the box.

All of the above also comes with lots of configuration to maintain and tweak to make work.
 DSE is supposed to solve these problems and be much easier to work with.

DSE is Hadoop without HDFS.  Instead, Datastax has written an HDFS emulation layer on top
of Cassandra (CFS).  The huge benefit here is that from the clients viewpoint, everything
works just like Hadoop[4].  MapReduce jobs and all of the fancy Hadoop tools still work. 
But there's no NameNode to deal with.  In Cassandra all nodes are peers.  This allows
for a much more homogenous cluster and less configuration[5].  Cassandra automatically
rebalances its data when a node goes down.

Ah but DSE has WMF related problems of its own!  Most importantly, DSE is not 100% open
source*.  The core of DSE is open source components available under the Apache license
(Hadoop, Cassandra, etc.).  As far as I can tell, anything that is not an Apache project
is proprietary.  Datastax has a comparison[7] of its Community vs. Enterprise versions. 
This includes packaging and wrapper CLI tools, but most importantly the Cassandra File
System (CFS) emulation layer.  This piece is what makes DSE so attractive.  

Also note that DSE is not free.  It is for development purposes, but it has a license that
we'd have to buy if we want to run it in production.  However, Diederik knows one of
the founders of Datastax who might give it to us at a discount, or perhaps for free.

Okokok.  So I'm starting this thread in order to answer my big question:  Are we
allowed to use DSE even if it is not 100% open source?  If the answer is an easy
'no', then our choice is easy:  we will use CDH4.  Is it possible to get an answer
to this question before we go down the road of dedicating time to evaluating and learning
DSE?

tl;dr:  Datastax is cooler than Cloudera, but Datastax is not 100% open source.  Can we
still use it?

- otto + the Analytics Team

*much like Cloudera for Hadoop, DataStax publishes tested, stable distributions of
Cassandra and ecosystem tools bundled together into one package. In addition to selling
support, their Enterprise edition contains proprietary code that, among other things,
provides HDFS emulation on top Cassandra. This code originated as Brisk[7], and was forked
when the developers founded DataStax.

[1] https://ccp.cloudera.com/display/CDH4DOC/Introduction+to+Hadoop+High+Availa…
[2] https://ccp.cloudera.com/display/FREE4DOC/Configuring+HDFS+High+Availability
[3] http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federati…
[4] http://www.datastax.com/faq#dse-2
[5] http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/
[6] http://wiki.apache.org/cassandra/HadoopSupport
[7] https://github.com/riptano/brisk , https://github.com/riptano/brisk-hadoop-common

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

[Analytics] Datastax or Cloudera for Analytics Batch Processing