Re: [Analytics] [Ops] Datastax or Cloudera for Analytics Batch Processing

26 Sep 2012

On Tuesday, September 25, 2012 at 3:25 PM, Asher Feldman wrote:

...
  What data processing patterns are better handled by
tools built on top of Hadoop than by native Cassandra queries?  
Range scans. Cassandra has two data partitioner classes: RandomPartitioner and
ByteOrderedPartitioner. The latter *does* allow you to partition data lexically by key
bytes, but by going that route you now have to think about load balancing explicitly, and
it's easy to make mistakes. You basically lose the cruise-control load balancing /
node management benefits of a DHT like Cassandra. Because that's so, "DataStax
strongly recommends against using the ordered partitioner", citing the following
reasons:

- Sequential writes can cause hot spots
- More administrative overhead to load balance the cluster
- Uneven load balancing for multiple column families

More details here:
http://www.datastax.com/docs/1.0/cluster_architecture/partitioning#byteorde…

HDFS is explicitly designed to help you reason about and manage data locality.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Ops] Datastax or Cloudera for Analytics Batch Processing