On Sep 25, 2012, at 4:37 PM, Andrew Otto otto@wikimedia.org wrote:
As discussed in the TechOps meeting yesterday, the Analytics team is evaluating two different Hadoop distributions for the batch processing layer of the Kraken cluster: Cloudera Hadoop 4 (CDH4) and Datastax Enterprise (DSE). I'm going to try to describe from a very high level why we think DSE is purrrrrty cool, and why CDH4 sounds like a relative headache. I'll then ask the questions we hope to answer soon.
[snip]
Most of this has already been said, so I'll keep it short now: Using Free and Open Source Software is an important part of what we do. There are many cases of open source software in use in our cluster with non-free "Enterprise" versions out there, with more and sometimes arguably better features than in the "open (source) core" versions. We generally don't use those non-free versions, especially not when it's a conscious decision made as a team. Even purely on a technical level, I don't think we've ever regretted those decisions, whereas we're constantly trying to move away from the few exceptions that have slipped through the cracks in the past.
We only use non-free solutions where there is no realistic, reasonable open source alternative available at all. With Enterprise versions based on existing open source software that's almost never the case, and in this particular case, judging from your comments as well as others, it doesn't seem to apply either. It's perfectly possible to use a fully open source package (Cloudera Hadoop) for your purposes, and thus that's what you should do at Wikimedia.