On Tuesday, September 25, 2012 at 3:25 PM, Asher Feldman wrote:
What data processing patterns are better handled by tools built on top of Hadoop than by native Cassandra queries?
Range scans. Cassandra has two data partitioner classes: RandomPartitioner and ByteOrderedPartitioner. The latter *does* allow you to partition data lexically by key bytes, but by going that route you now have to think about load balancing explicitly, and it's easy to make mistakes. You basically lose the cruise-control load balancing / node management benefits of a DHT like Cassandra. Because that's so, "DataStax strongly recommends against using the ordered partitioner", citing the following reasons:
- Sequential writes can cause hot spots - More administrative overhead to load balance the cluster - Uneven load balancing for multiple column families
More details here: http://www.datastax.com/docs/1.0/cluster_architecture/partitioning#byteorder...
HDFS is explicitly designed to help you reason about and manage data locality.