[QA] # 147 more reliable Beta Labs for QA

Greg Grossmeier greg at wikimedia.org
Wed Oct 29 18:31:38 UTC 2014


Antoine: yes :)

All: please email the QA mailing list with requests like this so we
don't keep split braining the conversation. I've done so now. (Also,
please subscribe!) (I've BCC'd Andrew O assuming he's just the
messenger here.)

Short version: Beta stability is always an on-going goal of RelEng.
With that said: can you give specific examples of Beta Cluster being
unstable *today*? Asking for "more reliable" is fine, but we need
targets and goals and bugs filed to address.

I fear much of the worry about Beta Cluster is due to the rocky
transition to HHVM (which was less than ideal). We are better
equipped/able to deal with such changes in the future right now (and
we are no longer experiencing HHVM-related issues, afaict).

We are also not planning to create yet-another cluster at this time
but instead use multiversion (hetdeploy) to deploy two versions on the
current cluster (master updating every 10 mins as now, and a nightly
updating once per day). That isn't on the immediate roadmap (iow: more
than a month out) and will wait until after the hardware procurement
for WMF Labs infrastructure. (NB: There is also the work to convert
the load of if-statements that currently make the prod puppet code
work on Beta Cluster to using Hiera instead, but that is slightly
orthogonal though it will help improve stability as well.)

I've also updated the referenced wiki page accordingly:
https://www.mediawiki.org/w/index.php?title=Wikimedia_Release_Engineering_Team%2FStaging_Cluster&diff=1245809&oldid=1202735

(Sorry, that page should have had {{draft}} on it from the beginning.)

Relatedly, Jenkins/Zuul are suffering from some performance issues and
that is one of the reasons Antoine is taking a sabbatical from IRC (to
address those issues; see that thread for more details). When those
issues happen (eg: failed browser tests due to time outs, which
shouldn't be happening very often at all anymore) a common response is
to blame Beta Cluster incorrectly. Instead, let's all file bugs when
those issues occur so that A) we have a record of them and B) we can
identify root cause instead of assuming which piece is failing.

Thanks,

Greg


On Wed, Oct 29, 2014 at 11:14 AM, Antoine Musso <amusso at wikimedia.org> wrote:
> Hello Greg,
>
> You are probably in a better seat to reply to that SOS card related to beta
> stability.
>
> Should we bring it up on QA list to reach a wider audience?
>
> ---------- Message transféré ----------
> De : "Andrew Otto" <aotto at wikimedia.org>
> Date : 29 oct. 2014 18:51
> Objet : # 147 more reliable Beta Labs for QA
> À : "Marc A. Pelletier" <marc at uberbox.org>, "Antoine Musso"
> <amusso at wikimedia.org>, <rkaldari at wikimedia.org>
> Cc :
>
>> Hi yalls,
>>
>> Ryan brought up this card at Scrum of Scrums today, and I have an action
>> item to ask you about it.
>>
>>
>> https://wikimedia.mingle.thoughtworks.com/projects/scrum_of_scrums/cards/147
>>
>>
>> https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Staging_Cluster
>>
>> Thoughts?
>>
>> Thanks!
>> -Ao
>>
>>
>



-- 
Greg Grossmeier
Release Team Manager



More information about the QA mailing list