[QA] # 147 more reliable Beta Labs for QA

Bryan Davis bd808 at wikimedia.org
Thu Nov 6 00:48:01 UTC 2014


On Wed, Nov 5, 2014 at 5:10 PM, Ryan Kaldari <rkaldari at wikimedia.org> wrote:
> On Wed, Oct 29, 2014 at 11:31 AM, Greg Grossmeier <greg at wikimedia.org>
> wrote:
>>
>> I fear much of the worry about Beta Cluster is due to the rocky
>> transition to HHVM (which was less than ideal). We are better
>> equipped/able to deal with such changes in the future right now (and
>> we are no longer experiencing HHVM-related issues, afaict).
>
>
> We've been told this many times, i.e. "Don't worry, the problems are all in
> the past." Yet Beta Labs keep having serious outages on a weekly basis. Just
> a few days after you sent this email, mobile Beta Labs had a nearly full-day
> outage which caused serious headaches for both the mobile and VE teams.[1] I
> would totally love to stay on the existing Beta Labs cluster, but we just
> keep having these outages week after week, despite assurances that things
> would stabilize. This is a major pain point that we need to have addressed
> in some way or another. Would spinning up an Alpha Labs cluster (for
> experimental features) be a reasonable solution?
>
> 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=72997

That outage was caused by https://gerrit.wikimedia.org/r/#/c/171055/
which was reviewed and merged by members of the mobile team. I'm not
sure how this is the fault of beta. Actually I'd say that it is
exactly what beta is for. The real problem here was that nobody
investigated the cause of the problem by logging in to beta and
looking at the logs. I'm pretty sure that this particular error could
have been reproduced in any development environment by checking out
the current git HEAD. I'm not trying to pick a fight here, I really am
trying to get a handle on the general expectations of the teams that
are making heavy use of beta in their daily workflow.

If we had a two stage integration environment (alpha & beta in the
current local vernacular), this error would have appeared in the alpha
environment first. Depending on the test coverage it may or may not
have been caught before the gating process to advance the code to the
slightly more stable beta environment. There are some advantages to
this sort of system but they come at a cost. Typically that cost is a
slower pace of production change. Depending on where you come down on
the "ship it and see what happens" spectrum you may see that as a good
or a bad thing.

Bryan
-- 
Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855



More information about the QA mailing list