Hi all,
as previously announced, we've been evaluating a "clustering solution" for use as an alternative to GridEngine for toollabs
https://lists.wikimedia.org/pipermail/wikitech-l/2015-August/082853.html
Our goal is also to find a suitable, modern, stable tool to run not only toollabs webservices, but also - on a longer term - to find a modern, easier, more convenient way to run our microservices in production: a clusterized environment that will allow us to enhance single service availalbility and also to apply easier scaling of applications, reducing further the friction surface and the direct ops involvement in the day-to-day setup and deployment of services.
Our evaluation of the available solutions is ongoing, and while we're mostly done filling up an "evaluation spreadsheet" (https://docs.google.com/spreadsheets/d/1YkVsd8Y5wBn9fvwVQmp9Sf8K9DZCqmyJ-ew-...), we would welcome and we encourage further involvement/suggestions. You can provide these easily on the tracking ticket for the evaluation, https://phabricator.wikimedia.org/T106475
We received some interesting feedback already, and we look forward incorporating more!
We are considering two solutions - mesospheres' Marathon (which is based on Mesos) - https://mesosphere.github.io/marathon/ and Google's Kubernetes https://kubernetes.io.
Now let us summarize a bit our findings so far: MESOS/MARATHON:
Pros: - Mesos is stable and battle tested, although Marathon is quite young and mostly used in mesosphere's commercial offering - Supports overcommitting resources (which is important in toollabs, probably less so in production) - Has a nice, clean API and is fully distributed with no potential SPOFs - Chronos is another framework that can run on mesos and is a great distributed cron
Cons: - Multitenancy story is non-existent, it was not designed to be a public PaaS offering. This is an issue even in production if we want to grant independence to single teams. - Container support seems experimental at best.(but getting better in newer versions) - Adoption of Marathon seems little and the community is not very lively. - Discovery/scaling logic is somewhat limited
KUBERNETES
Pros: - The design seems to be very well thought out, based off of experiences running Google's internal Borg system (see http://research.google.com/pubs/pub43438.html for details of Google's Borg clustering system). - A pretty refined security model is already implemented, so that single users/teams could be given access to individual namespaces and act independently - The community is very lively, and adoption is gaining momentum: kubernetes is the default way to deploy apps on Google Compute Engine, it's used by Red Hat for its own cloud solution (and they contribute patches to it), it has a clear roadmap to overcome most of its limitations - Container support is native and it's tecnology-agnostic, allowing (for now) Docker and Rkt containers to be used - The API is quite nice - Documentation is decently complete - Google engineers are actively supporting us in evaluating its usage Cons: - The master node is not highly available, although our cluster survived a pretty serious outage in labs that froze the master and wiped out one worker - No overcommitting allowed, it will be possible to mimic it with QoS (coming in the next version) - The ability to schedule one-off jobs is offered, but there is no distributed cron facility - In general it's a younger project with some outstanding bugs
As you can see there are pretty big pros/cons for both these technologies, due to the fact they are still quite "not boring" - although one could argue that mesos and chronos at least have entered their "boring" stage. Our spreadsheet slightly favours Kubernetes at the moment, but that might change drastically, if we evaluate that some limitations are absolute showstoppers for us.
In the remainder of this week and the next few ones, we will keep stress testing both our test installations to find out "surprises" and bugs.
Let us know what you think - or reach out to us if you want to help in this evaluation process. We will keep you posted!
Cheers,
Giuseppe & Yuvi