[QA] Flagging some Beta Labs issues

S Page spage at wikimedia.org
Tue Aug 19 18:19:51 UTC 2014


(I added Ori)

On Mon, Aug 18, 2014 at 4:35 PM, Maryana Pinchuk <mpinchuk at wikimedia.org>
wrote:

> Greetings, QAers,
>
> I'm not entirely sure who the point person for Beta Labs is currently,
>
Forces under Greg Grossmeier?


> but I wanted to make sure you guys are aware that there have been a lot of
> issues (i.e., partial or total outages) with this environment in the past
> 2-3 weeks, most likely due to ongoing HHVM work.
>

Perhaps, but also there was disk full and rsync problems that ErikB and
hashar fixed. Many of the people with the ability to fix problems were out.
Beta labs has always been unstable, but (roughly since the eqiad move) it's
been pretty good -- retry 20 minutes after a failure and it's working
again. Maybe stability improved thanks to all our test automation, or maybe
we got lucky.


> Unfortunately, because several teams at WMF rely quite heavily on Beta
> Labs – such as Mobile Web for testing new user-facing features before they
> go live in production,
>
Fine, so long as you realize you're also testing everyone else's changes
that are about to go into production.


> and Design/UX for running remote and in-person user tests
>

That seems crazy. Beta labs constantly updates with the latest merged
changes to core and dozens of extensions, I think 288 times a day. Any
commit could break it, and they regularly do! Beta labs is there to test to
find breaking changes before they go live, not for user testing.

Maybe we should document better how to set up a reasonably performant labs
instance with a decent set  of wiki pages, templates, images, etc. Then UX
can spin up ux-wikimania.wmflabs.org and be isolated from the firehose of
changes.


> To avoid situations like this in the future, is there a way for teams who
> use Beta Labs for testing to stay in closer sync with its maintainers?
>

People on #wikimedia-labs (bd808, Coren, hashar, Reedy, et al) are very
responsive if they're not away :) , then I alert #wikimedia-qa for test
failures and if there's no response and beta labs is badly broken (e.g. no
Main_Page) then I visit #wikimedia-operations. If you can figure out what
extension's causing the problem, visit its IRC channel.


> I realize y'all aren't mind-readers ;) and there will of course be
> unexpected issues that crop up from time to time. But if there are likely
> to be more major breaking changes to the infrastructure while you continue
> working on HHVM, it would be great to get an advanced heads-up so we can
> plan accordingly.
>
See Chris' proposal below for an alpha cluster for breaking changes. Until
then I dunno... status page?  Archived e-mail list?


> And when unexpected outages do occur, it'd also be good to know who to
> report them to and check in on progress with, because I'm not sure the
> current strategy of whining and hoping it'll fix itself is
> working/sane/scalable :)
>

I can't see anything better than "Beta labs isn't working when I do X,
what's up?" on IRC.


Chris McMahon replied

we had a policy of using beta only for software already in production.  We
> broke that policy with CirrusSearch, followed by Flow, and now by testing
> HHVM in beta labs.
>

Flow was just following orders. We're implicitly told to test new
extensions on Beta labs "for weeks" before getting the OK to deploy to
production.[1]

I would like [more shared cluster environments]  for modeling the
> production cluster (like the original beta concept) and a different one for
> working on system-wide, cross-cutting changes (like CirrusSearch, Flow,
> HHVM, etc.)


That sounds fantastic. Currently beta labs is both for "Sanity check of
merged code before it rolls out to mediawiki.org on Thursday" and for "See
how my new project fares in a production-like environment."


>  I could see having yet another "beta3" env for working with code not yet
> merged to master.
>

Isn't that what per-project labs instances are for?  How would teams
identify what goes onto beta3?  It risks becoming an odd graveyard like
test.wikipedia.org.

Cheers, and thanks for making beta labs a part of our lives.

[1]
https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1d:_new_extension

-- 
=S Page  Features engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/qa/attachments/20140819/37816e3f/attachment.html>


More information about the QA mailing list