Re: [Analytics] [Ops] Datastax or Cloudera for Analytics Batch Processing

26 Sep 2012


      On Tuesday, September 25, 2012 at 3:25 PM, Asher Feldman wrote:
...
What data processing patterns are better handled by tools built on top of Hadoop than by native Cassandra queries?
Range scans. Cassandra has two data partitioner classes: RandomPartitioner and ByteOrderedPartitioner. The latter *does* allow you to partition data lexically by key bytes, but by going that route you now have to think about load balancing explicitly, and it's easy to make mistakes. You basically lose the cruise-control load balancing / node management benefits of a DHT like Cassandra. Because that's so, "DataStax strongly recommends against using the ordered partitioner", citing the following reasons:
- Sequential writes can cause hot spots
- More administrative overhead to load balance the cluster
- Uneven load balancing for multiple column families
More details here: http://www.datastax.com/docs/1.0/cluster_architecture/partitioning#byteorder...
HDFS is explicitly designed to help you reason about and manage data locality.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Ops] Datastax or Cloudera for Analytics Batch Processing