On Sep 25, 2012, at 4:37 PM, Andrew Otto <otto(a)wikimedia.org> wrote:
As discussed in the TechOps meeting yesterday, the
Analytics team is evaluating two different Hadoop distributions for the batch processing
layer of the Kraken cluster: Cloudera Hadoop 4 (CDH4) and Datastax Enterprise (DSE).
I'm going to try to describe from a very high level why we think DSE is purrrrrty
cool, and why CDH4 sounds like a relative headache. I'll then ask the questions we
hope to answer soon.
Most of this has already been said, so I'll keep it short now: Using Free and Open
Source Software is an important part of what we do. There are many cases of open source
software in use in our cluster with non-free "Enterprise" versions out there,
with more and sometimes arguably better features than in the "open (source)
core" versions. We generally don't use those non-free versions, especially not
when it's a conscious decision made as a team. Even purely on a technical level, I
don't think we've ever regretted those decisions, whereas we're constantly
trying to move away from the few exceptions that have slipped through the cracks in the
We only use non-free solutions where there is no realistic, reasonable open source
alternative available at all. With Enterprise versions based on existing open source
software that's almost never the case, and in this particular case, judging from your
comments as well as others, it doesn't seem to apply either. It's perfectly
possible to use a fully open source package (Cloudera Hadoop) for your purposes, and thus
that's what you should do at Wikimedia.
Mark Bergsma <mark(a)wikimedia.org>
Lead Operations Architect