On Tuesday, September 25, 2012 at 3:25 PM, Asher Feldman wrote:
What data processing patterns are better handled by
tools built on top of Hadoop than by native Cassandra queries?
Range scans. Cassandra has two data partitioner classes: RandomPartitioner and
ByteOrderedPartitioner. The latter *does* allow you to partition data lexically by key
bytes, but by going that route you now have to think about load balancing explicitly, and
it's easy to make mistakes. You basically lose the cruise-control load balancing /
node management benefits of a DHT like Cassandra. Because that's so, "DataStax
strongly recommends against using the ordered partitioner", citing the following
reasons:
- Sequential writes can cause hot spots
- More administrative overhead to load balance the cluster
- Uneven load balancing for multiple column families
More details here:
http://www.datastax.com/docs/1.0/cluster_architecture/partitioning#byteorde…
HDFS is explicitly designed to help you reason about and manage data locality.