Basics: We rolled out 1.21wmf5 to the non-Wikipedia sites today, after a
brief reversion and re-deployment to fix breakage in how we were
displaying some styling. We are on track to deploy 1.21wmf5 to English
Wikipedia on Monday, December 3 per
Below: why this happened and how it got fixed, and what we should change
to prevent problems like this in the future.
changed the headings in the
Vector skin. The new code didn't take the WMF config into account, as
the author wasn't expecting styles and HTML to be cached in such
The headings were changed from "h4"/"h5", but the CSS used those tags
identify them (instead of using CSS classes). Which means, as expected,
that the page layout breaks for up to 30 days.
Page cache is controlled by the wiki page content. Unless the page is
modified, the cache is kept for up to 30 days for anonymous users.
Resource modules, however, are served by ResourceLoader which has its
own much more efficient and deployable cache mechanism. But this means
that the resources for the skin are deployed globally and site-wide
within 5 minutes.... whereas the HTML isn't for another 2 weeks.
The issues that caused were visible in beta labs for the last three
days, but none of us realized they were significant, we thought they
were caused by a misconfigured memcache; see
We knew that this particular change and the related change
might be problematic and sent
out a note about it on Monday --
-- but it looks like we didn't test thoroughly enough on Monday and
Tuesday to catch it before the Wednesday deploy. Only anonymous users
would have been affected. We don't cache logged-in users in Squid. So
logged-in users didn't notice problems on mediawiki.org
after the first deploy.
Problems popped up after the Phase 2 deployment to non-Wikipedia sites,
so we reverted the 1.21wmf5 deployment and then redeployed while fixing,
Gerrit changes: https://gerrit.wikimedia.org/r/#/c/35819
What we should fix for the future:
This is why client resources must always be backwards compatible.
"Don't change the HTML in incompatible ways" is probably a good
general rule to live by--but having an easy way to say "start purging
all pages on $theseWikis from Squid/Varnish" would also be nice.
get more manual testing on test2.wikipedia.org
immediately after Phase I deployment, including as anonymous reader and
editor to ensure we catch Squid caching issues
train more people to review code well, to reduce backlog and catch
these kinds of problems?
get more people to +2 in core and in important extensions
beta labs needs to be trustworthy enough to make this sort of thing
a blocker immediately
Chris McMahon's take: (for what it's worth, this seems to me to be
a sign that beta labs is becoming more and more trustworthy all the
time. The more we actually use it, the more we'll understand what does
and does not work there. We fixed the memcache problem, which fixed the
ability to login, but didn't investigate the display problems because
we're used to beta not being very reliable. In this case, beta was
reliable, and we didn't understand that. Even with a bug report in
bugzilla with 9 subscribers, no one recognized a real issue.)
Chris McMahon said: I think this could be framed as an issue of signal,
noise, and bandwidth. Beta labs being broken a lot, review backlog in
gerrit, false failures in tests are all noise. Given the constraints of
ongoing projects, it is difficult to pick out the signal from the noise.
We can take steps to reduce the noise so that the signal stands out
more by reducing technical debt: make the tests green, make the test
environment robust, keep up with code review.
(I assembled this just now from IRC & mailing list chatter from several
people, and errors are mine -- sorry for missing attributions here.
Drafting was on http://etherpad.wmflabs.org/pad/p/nov-28-2012-deploysnafu
Engineering Community Manager