Hi everyone,
The email below was written with an internal audience in mind, but
Krenair pointed out that there would generally be a lot of general
interest in this.
Rob
---------- Forwarded message ----------
From: Rob Lanphier <robla(a)wikimedia.org>
Date: Fri, Jul 18, 2014 at 5:20 PM
Subject: HHVM deployment update
To: WMF Engineering List, Operations Engineers
Hi everyone,
I'm writing to give you a quick update about where we are with HHVM
deployment, so you know what to expect. Ori, Aaron, Tim, Giuseppe,
Brett, Antoine and probably others I'm forgetting have been hard at
work getting HHVM ready, and are about to make changes that may affect
your work in production.
The good news is that the way it'll most affect you is that your code
will run a lot faster. The bad news is that there is some risk of
breakage due to the number and nature of things we're changing.
We've started referring to the stack that we're migrating to as "HAT"
(HHVM, Apache 2.4, Trusty), as a nod to LAMP and as a useful tag on
Gerrit changes. The point being: this is not merely a change in our
PHP implementation, but a full stack upgrade that may have
implications beyond just the problems that might be introduced by
HHVM.
The team is deploying this one piece at a time, with the first bit of
the deployment happening very soon. Here's our rough timeline:
* Now: limited deployments of production job runners to osmium, which
the team only leaves on when they are monitoring it for errors
* Week of July 21: Deployment to Beta Cluster. The timing on this may
slip, since it might be a surprise to a few people who are deeply
affected by it (/me waves to Chris McMahon), but we think it's
generally ready from an engineering perspective.
* Week of July 21: Deployment to a few job runners in production.
You'll know the first job runner was deployed when you see this
patch[1] get its +2. We didn't get a chance to coordinate this with
Greg today, so exact timing is TBD.
* Sometime later: Deployment to
test.wikipedia.org application server
* Sometime later: Deploy Varnish module allowing partial deployment to
a fraction of application servers
* Sometime later: Limited deployment to small number of application servers
* Sometime later: Ramp up deployment to more application servers until
most servers use HHVM
* Sometime later: Deploy to remainder of services
How to test your extension with HHVM:
-------------------------------------
Historically, we've treated HHVM-related bugs in MediaWiki extensions
as the sole responsibility of HHVM team, because we could not
reasonably expect developers to test their code on HHVM while it was
still difficult to build and configure. As we head toward full
deployment, however, we are going to progressively shift
responsibility onto you to be proactive about testing your code with
HHVM and reporting any issues you encounter.
If you're not sure how to test your code with HHVM, ask! The options
that are currently available to you are:
* On your machine, using MediaWiki-Vagrant (HHVM is the default PHP runtime)
* On Labs, using Labs-Vagrant
(<https://wikitech.wikimedia.org/wiki/Labs-vagrant>). The Flow team is
doing this; ask them how. :)
* Sometime next week: on the Beta cluster, when we switch it over to HHVM.
Use the "hiphop" keyword in Bugzilla to catch our attentions.
Some things that will change, and the associated challenges:
* Lots of C++ code that is generally high-quality but doesn't have
quite as many flight-hours logged in production as PHP. All that is
entailed by that.
* We expect the performance profile to improve substantially, but we
can't rule out the possibility that specific operations will suffer
performance regressions
* Distribution-upgrade risks: there are many utilities we rely on
besides MediaWiki itself, and many of those utilities will see
upgrades as well. For example, a lot of the utilities on our image
scalers (e.g. imagemagick, avconf, etc) will be upgraded.
What we're doing to mitigate/minimize risk: test, test, test. A lot
of the work that's been going on has been to improve the state of our
unit tests such that we can have a clean test run before deploying all
the way; a task made trickier by the fact that our current codebase
doesn't meet that bar[2]
That's all for now. More information about HHVM and our deployment to
it can be found at
https://www.mediawiki.org/wiki/HHVM . Anything
that isn't there, come talk to me, and I'll turn around and ask Ori.
:-)
Thanks!
Rob
[1] Puppet repo patch "jobrunner: create hhvm-only jobrunners"
https://gerrit.wikimedia.org/r/#/c/147086/
[2] Tracking bugs for unit tests that fail in HHVM (and
unfortunately, our current production setup too):
https://bugzilla.wikimedia.org/show_bug.cgi?id=67216