Gerrit Commit Wars

List overview All Threads
Download

newer

older

Review requested on a logging fix...

Wikipedia new app beta sprint 26...

Tyler Romeo

6 Mar 2014 6 Mar '14

2:29 p.m.

Hi everybody,

I cannot believe I have to say something about this, but I guess it's no surprise.

Wikipedia has a notorious policy against edit warring, where users are encouraged to discuss changes and achieve consensus before blindly reverting. This applies even more so to Gerrit, since changes to software have a lot bigger effect.

Here's a nice example: https://gerrit.wikimedia.org/r/114400 https://gerrit.wikimedia.org/r/117234 https://gerrit.wikimedia.org/r/117247

Some key points to note here: * The revert commit was not linked to on the original commit * The time between the revert patch being uploaded and +2ed was a mere two minutes * All the reviewers on the revert patch were also reviewers on the original patch

This is unacceptable behavior, and is extremely disrespectful to the developers here. If you are going to revert a patch for reasons other than a blatant code review issue (such as a fatal error or the likes), you should *at the very least* give the original patch reviewers time to understand why the patch is being reverted and give their input on the matter. Otherwise it defeats the entire point of the code review process and Gerrit in the first place.

The argument being made in this specific case is that the change broke the workflow of mobile, and that the revert was announced on mobile-l. This is not sufficient for a number of reasons:

1) not everybody is subscribed to mobile-l, so you cannot expect the original reviewers to see or know about it 2) this is an issue with MobileFrontend, not MediaWiki core 3) code being merged does not automatically cause a deployment, and if code being deployed breaks something in production, it is the operations team's job to undeploy that change

Overall, the lesson to take away here is to be more communicative with other developers, especially when you are negating their changes or decisions.

Thanks in advance, *-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Show replies by date

Steven Walling

6 Mar 6 Mar

3:15 p.m.

On Thu, Mar 6, 2014 at 8:29 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

not everybody is subscribed to mobile-l, so you cannot expect the

original reviewers to see or know about it 2) this is an issue with MobileFrontend, not MediaWiki core 3) code being merged does not automatically cause a deployment, and if code being deployed breaks something in production, it is the operations team's job to undeploy that change

If your patch causes a serious UX regression like this, it's going to get reverted. The core patch involved was being deployed to Wikimedia sites / impacting MobileFrontEnd users today. If we had more time in the deployment cycle to wait and the revert was a simple disagreement, then waiting would be appropriate. It is obvious in this case no one tested the core change on mobile. That's unacceptable.

And yes, Jon should have made sure the revert and the original patch were cross-referenced. I'm sure he'll do that next time he commits a revert.

Steven

Tyler Romeo

4:08 p.m.

On Thu, Mar 6, 2014 at 4:15 PM, Steven Walling steven.walling@gmail.comwrote:

...

If your patch causes a serious UX regression like this, it's going to get reverted. The core patch involved was being deployed to Wikimedia sites / impacting MobileFrontEnd users today. If we had more time in the deployment cycle to wait and the revert was a simple disagreement, then waiting would be appropriate. It is obvious in this case no one tested the core change on mobile. That's unacceptable.

You quoted my email, but didn't seem to read it. Changes to MediaWiki core should not have to take into account extensions that incorrectly rely on its interface, and a breakage in a deployed extension should result in an undeployment and a fix to that extension, not a revert of the core patch.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Erik Bernhardson

5:21 p.m.

On Thu, Mar 6, 2014 at 2:08 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Thu, Mar 6, 2014 at 4:15 PM, Steven Walling <steven.walling@gmail.com

...
wrote:

...
If your patch causes a serious UX regression like this, it's going to get reverted. The core patch involved was being deployed to Wikimedia sites / impacting MobileFrontEnd users today. If we had more time in the

deployment

...
cycle to wait and the revert was a simple disagreement, then waiting

would

...
be appropriate. It is obvious in this case no one tested the core change

on

...
mobile. That's unacceptable.

You quoted my email, but didn't seem to read it. Changes to MediaWiki core should not have to take into account extensions that incorrectly rely on its interface, and a breakage in a deployed extension should result in an undeployment and a fix to that extension, not a revert of the core patch.

I don't think core is in any way special here. It doesn't matter what broke what, the whole is much more important than the individual parts. If the patch to core is what broke things reverting it is the appropriate course of action.

*-- *

...

*Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jon Robson

5:30 p.m.

Communication in the wikiverse is hard.

To clarify, this is _not_ an issue with MobileFrontend. The same problem effects users without JavaScript. There was a fundamental problem with this patch that sadly didn't get caught during code review. It broke the workflow of mobile on an important page in production which is a bad thing. On a side note it saddens me that mobile gets very little attention during code review on essential parts of our infrastructure. If anyone has any ideas on how this can be remedied please let me know.

Moan about the deployment train: The code was merged at the final hour before a deployment train (this is another issue that our deployment train doesn't distinguish between patches that have been sitting on master for a week with patches that have been sitting there for 1 hour). Had this been merged on a Thursday morning we would have had more luxury and a revert maybe could have been avoided (but I still don't think that patch was in a mergeable format).

In answer to a few statements you made...

"Wikipedia has a notorious policy against edit warring, where users are encouraged to discuss changes and achieve consensus..." Agreed but that consensus should also be achieved during review. It seems during the code review process [1] there was an open concerns that had been raised and a -1 from Steven that was unaddressed. In this case we have the luxury to discuss this more and explore problems and in my opinion it was not worthy of a rushed merge. Yes we can't please everyone but it would have been good to get more people involved in this conversation.

"not everybody is subscribed to mobile-l, so you cannot expect the original reviewers to see or know about it" Yes, and posts to wikimedia-l go straight to my archive, so I usually miss them so I wasn't aware of this mail until someone pointed me to it. Communicating so everyone gets a message is hard :-).

That said I did screw up here though in that I didn't comment on the patchset with a link to the mobile-l mailing list. In fact I started to and then got distracted by a conversation and forgot to hit save. I Will be more careful in future. All conversations about code should start in code and I'm sorry I didn't adhere to this rule this time.

[1] https://gerrit.wikimedia.org/r/#/c/114400/

On Thu, Mar 6, 2014 at 2:08 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Thu, Mar 6, 2014 at 4:15 PM, Steven Walling steven.walling@gmail.comwrote:

...
If your patch causes a serious UX regression like this, it's going to get reverted. The core patch involved was being deployed to Wikimedia sites / impacting MobileFrontEnd users today. If we had more time in the deployment cycle to wait and the revert was a simple disagreement, then waiting would be appropriate. It is obvious in this case no one tested the core change on mobile. That's unacceptable.

You quoted my email, but didn't seem to read it. Changes to MediaWiki core should not have to take into account extensions that incorrectly rely on its interface, and a breakage in a deployed extension should result in an undeployment and a fix to that extension, not a revert of the core patch.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Jon Robson * http://jonrobson.me.uk * https://www.facebook.com/jonrobson * @rakugojon

Brion Vibber

5:34 p.m.

On Thu, Mar 6, 2014 at 2:08 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Thu, Mar 6, 2014 at 4:15 PM, Steven Walling <steven.walling@gmail.com

...
wrote:

...
If your patch causes a serious UX regression like this, it's going to get reverted. The core patch involved was being deployed to Wikimedia sites / impacting MobileFrontEnd users today. If we had more time in the

deployment

...
cycle to wait and the revert was a simple disagreement, then waiting

would

...
be appropriate. It is obvious in this case no one tested the core change

on

...
mobile. That's unacceptable.

You quoted my email, but didn't seem to read it. Changes to MediaWiki core should not have to take into account extensions that incorrectly rely on its interface, and a breakage in a deployed extension should result in an undeployment and a fix to that extension, not a revert of the core patch.

Changes to MediaWiki core should avoid breaking Wikipedia in production, especially since we aggressively push new versions of core and extensions to Wikipedia every few weeks.

For years and years and years we've been very free about reverting things that break. No one, including old-timers like me and Tim, has the "right" to not have something reverted. If it needs to be reverted it will be reverted -- there is nothing personal in a revert. Remember it can always be put back once all problems are resolved.

Is there anything specific in the communications involved that you found was problematic, other than a failure to include a backlink in the initial revert?

-- brion

George Herbert

5:53 p.m.

To take a couple of steps back...

This happened because testing isn't robust enough?

That should be discussed and followed up on.

On Thu, Mar 6, 2014 at 3:34 PM, Brion Vibber bvibber@wikimedia.org wrote:

...

On Thu, Mar 6, 2014 at 2:08 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...
On Thu, Mar 6, 2014 at 4:15 PM, Steven Walling <steven.walling@gmail.com

...
wrote:

...
If your patch causes a serious UX regression like this, it's going to

get

...
...
reverted. The core patch involved was being deployed to Wikimedia

sites /

...
...
impacting MobileFrontEnd users today. If we had more time in the

deployment

...
cycle to wait and the revert was a simple disagreement, then waiting

would

...
be appropriate. It is obvious in this case no one tested the core

change

...
on

...
mobile. That's unacceptable.

You quoted my email, but didn't seem to read it. Changes to MediaWiki

core

...
should not have to take into account extensions that incorrectly rely on its interface, and a breakage in a deployed extension should result in an undeployment and a fix to that extension, not a revert of the core patch.

Changes to MediaWiki core should avoid breaking Wikipedia in production, especially since we aggressively push new versions of core and extensions to Wikipedia every few weeks.

For years and years and years we've been very free about reverting things that break. No one, including old-timers like me and Tim, has the "right" to not have something reverted. If it needs to be reverted it will be reverted -- there is nothing personal in a revert. Remember it can always be put back once all problems are resolved.

Is there anything specific in the communications involved that you found was problematic, other than a failure to include a backlink in the initial revert?

-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- -george william herbert george.herbert@gmail.com

Greg Grossmeier

6:38 p.m.

...

To take a couple of steps back...

This happened because testing isn't robust enough?

That should be discussed and followed up on.

Ish.

The suite of automatic tests caught this bug, actually. It's how the mobile team found out about it as they got to work this morning. So the testing is quite robust.

I'd argue, that the revert didn't happen fast enough.

"Whoa! Greg! Don't be incendiary!"

What I mean by that is:

I want us (where 'us' == any developer writing MediaWiki or extension code) to get to the point where we reject and revert any commit which breaks the test suite. Basically, when a test fails after a commit you (as the developer) should:

A) fix it right away (like, now) or B) revert your commit that broke it and work on a fix

B enables other people to continue working with a good state of the software.

Doing C, which is what happened here, makes *your work stop other people's productivity*. Period.

This should happen no matter where the test fails; if it's "your code" or not. Your code caused it (in the sense that the test didn't fail before), so you should work on fixing it.

There's a bit more nuance here:

A test can fail for any number of reasons including badly written tests or the test infrastructure failing somehow. Part of the above decision tree should include determining what actually broke. If the test was badly written, rewrite it or remove it if it no longer applies.

We aren't to this point yet; we need a bit more test coverage and we need to speed up the feedback cycle for auto browser tests, but we're headed in this direction. On purpose.

We need to get in the habit of making every commit deployable. No more breaking beta cluster for a day while we work something out. No more breaking other parts of the code base and ignoring it because 'it's not core'.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

6:49 p.m.

So I'm confused on the timeline here.

Did the commit get merged before the testsuite found the breakage, or did the commit get merged despite the testsuite failing?

Either way sounds like "When to merge a changeset" needs reviewed.

On Thu, Mar 6, 2014 at 6:38 PM, Greg Grossmeier greg@wikimedia.org wrote:

...

<quote name="George Herbert" date="2014-03-06" time="15:53:50 -0800"> > To take a couple of steps back... > > This happened because testing isn't robust enough? > > That should be discussed and followed up on.

Ish.

The suite of automatic tests caught this bug, actually. It's how the mobile team found out about it as they got to work this morning. So the testing is quite robust.

I'd argue, that the revert didn't happen fast enough.

"Whoa! Greg! Don't be incendiary!"

What I mean by that is:

I want us (where 'us' == any developer writing MediaWiki or extension code) to get to the point where we reject and revert any commit which breaks the test suite. Basically, when a test fails after a commit you (as the developer) should:

A) fix it right away (like, now) or B) revert your commit that broke it and work on a fix

B enables other people to continue working with a good state of the software.

Doing C, which is what happened here, makes *your work stop other people's productivity*. Period.

This should happen no matter where the test fails; if it's "your code" or not. Your code caused it (in the sense that the test didn't fail before), so you should work on fixing it.

There's a bit more nuance here:

A test can fail for any number of reasons including badly written tests or the test infrastructure failing somehow. Part of the above decision tree should include determining what actually broke. If the test was badly written, rewrite it or remove it if it no longer applies.

We aren't to this point yet; we need a bit more test coverage and we need to speed up the feedback cycle for auto browser tests, but we're headed in this direction. On purpose.

We need to get in the habit of making every commit deployable. No more breaking beta cluster for a day while we work something out. No more breaking other parts of the code base and ignoring it because 'it's not core'.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Chris McMahon

6:55 p.m.

On Thu, Mar 6, 2014 at 5:49 PM, OQ overlordq@gmail.com wrote:

...

So I'm confused on the timeline here.

Did the commit get merged before the testsuite found the breakage, or did the commit get merged despite the testsuite failing?

The commit was merged late Wednesday. The automated tests that demonstrated the problem failed over Wednesday night and we analyzed the failures early Thursday morning, which is routine.

As noted above, code committed late on Wednesday or early Thursday only resides in the test environment on beta labs for a short time before going to production wikis. We intend to improve this situation in the not too distant future, but for now that is the situation on the ground. -Chris

7:07 p.m.

So the testsuite only runs on merged code and not pending-merge? That sounds like a large oversight.

On Thu, Mar 6, 2014 at 6:55 PM, Chris McMahon cmcmahon@wikimedia.orgwrote:

...

On Thu, Mar 6, 2014 at 5:49 PM, OQ overlordq@gmail.com wrote:

...
So I'm confused on the timeline here.

Did the commit get merged before the testsuite found the breakage, or did the commit get merged despite the testsuite failing?

The commit was merged late Wednesday. The automated tests that demonstrated the problem failed over Wednesday night and we analyzed the failures early Thursday morning, which is routine.

As noted above, code committed late on Wednesday or early Thursday only resides in the test environment on beta labs for a short time before going to production wikis. We intend to improve this situation in the not too distant future, but for now that is the situation on the ground. -Chris _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Chris McMahon

8:09 p.m.

On Thu, Mar 6, 2014 at 6:07 PM, OQ overlordq@gmail.com wrote:

...

So the testsuite only runs on merged code and not pending-merge? That sounds like a large oversight.

Picture in your mind every branch pending merge for every extension in gerrit. Imagine how many of those branches are eventually abandoned, imagine how many patch sets each receives, imagine how many times each gets rebased.

And even if we had such tests, they would not have exposed today's issue.

We run UI-level regression tests against a model of the Wikipedia cluster on beta labs running the master branch *exactly* so that we can expose cross-repo problems, configuration problems, etc. before they go to production.

Today's issue was hardly unique. Just one week ago our tests picked up an entirely unrelated but similarly surprising issue that had the MobileFrontend team scrambling on a Thursday morning: https://bugzilla.wikimedia.org/show_bug.cgi?id=62004. We stop bugs *all the time* this way.

This is hardly an "oversight". These tests and these test environments are very carefully designed to expose exactly the kind of issues that they expose. They have saved us an extraordinary amount of pain by preventing bugs released to production.

Jon Robson

8:21 p.m.

I wonder in future if it might be practical useful for test failures like this to automatically revert changes that made them or at least submit patches to revert them.... that way it's clear how and when things should be reverted. On 6 Mar 2014 18:09, "Chris McMahon" cmcmahon@wikimedia.org wrote:

...

On Thu, Mar 6, 2014 at 6:07 PM, OQ overlordq@gmail.com wrote:

...
So the testsuite only runs on merged code and not pending-merge? That sounds like a large oversight.

Picture in your mind every branch pending merge for every extension in gerrit. Imagine how many of those branches are eventually abandoned, imagine how many patch sets each receives, imagine how many times each gets rebased.

And even if we had such tests, they would not have exposed today's issue.

We run UI-level regression tests against a model of the Wikipedia cluster on beta labs running the master branch *exactly* so that we can expose cross-repo problems, configuration problems, etc. before they go to production.

Today's issue was hardly unique. Just one week ago our tests picked up an entirely unrelated but similarly surprising issue that had the MobileFrontend team scrambling on a Thursday morning: https://bugzilla.wikimedia.org/show_bug.cgi?id=62004. We stop bugs *all the time* this way.

This is hardly an "oversight". These tests and these test environments are very carefully designed to expose exactly the kind of issues that they expose. They have saved us an extraordinary amount of pain by preventing bugs released to production. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tyler Romeo

8:58 p.m.

On Thu, Mar 6, 2014 at 9:21 PM, Jon Robson jdlrobson@gmail.com wrote:

...

I wonder in future if it might be practical useful for test failures like this to automatically revert changes that made them or at least submit patches to revert them.... that way it's clear how and when things should be reverted.

Or, at the very least, changes are not deployed until these tests pass. I agree that running expensive browser tests on every Gerrit change is unnecessary and performance-intensive, but we can expect it to be OK to run it before new deploys.

Rob said:

...

I wholeheartedly disagree with this. Changes to core should definitely take into account uses by widely-deployed extensions (where "widely-deployed" can either mean by installation count or by end-user count), even if the usage is "incorrect". We need to handle these things on a case by case basis, but in general, *all* of the following are options when a core change introduces an unintentional extension incompatibility:

Fix the extension quickly

Revert the change

Undeploy the extension until its fixed to be compatible with core

I don't think you see the problem here. Consider this case as an example (I agree that this is case-by-case, so let's limit the scope to this one). You're forgetting that the original patch fixes a bug. In fact, it fixes a pretty serious UX bug in my opinion (and many others who supported merging this patch).

So to summarize, #3 is obviously not an option. For #2, are we supposed to block core development, and let this bug persist indefinitely, because of a much less serious bug in an extension? That really only leaves #1, but apparently the vast minority of opponents of the original patch decided it was a good idea to jump right over and skip to #2.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

George Herbert

9:13 p.m.

Tyler wrote:

...

...

Fix the extension quickly

Revert the change

Undeploy the extension until its fixed to be compatible with core

...

So to summarize, #3 is obviously not an option. For #2, are we supposed to

block core development, and let this bug persist indefinitely, because of a

...

much less serious bug in an extension? That really only leaves #1, but apparently the vast minority of opponents of the original patch decided it was a good idea to jump right over and skip to #2.

I think you're setting up a false premise.

Based on the timing description here, it seems more like "Either rush 1 or rush 2".

One of these is going back to a known state, albeit with a known bug. The other is going forwards to an unknown state.

As the grumpy cat cartoon going around recently points out, our code had 99 bugs in it, we patched one, and now we have 127 bugs in it. Rushing new, untested patches forwards under deadlines starts to approach the "real world people get fired for this" behavior from earlier comments.

Rolling back lets you work out what the right thing to do is for next week, without (much) time pressure. If the pre-existing bug was tolerable enough to have been there for a while without requiring an emergency prod patch, it can stay another week without disaster striking.

On Thu, Mar 6, 2014 at 6:58 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Thu, Mar 6, 2014 at 9:21 PM, Jon Robson jdlrobson@gmail.com wrote:

...
I wonder in future if it might be practical useful for test failures like this to automatically revert changes that made them or at least submit patches to revert them.... that way it's clear how and when things should be reverted.

Or, at the very least, changes are not deployed until these tests pass. I agree that running expensive browser tests on every Gerrit change is unnecessary and performance-intensive, but we can expect it to be OK to run it before new deploys.

Rob said:

...
I wholeheartedly disagree with this. Changes to core should definitely take into account uses by widely-deployed extensions (where "widely-deployed" can either mean by installation count or by end-user count), even if the usage is "incorrect". We need to handle these things on a case by case basis, but in general, *all* of the following are

options

...
when a core change introduces an unintentional extension incompatibility:

Fix the extension quickly

Revert the change

Undeploy the extension until its fixed to be compatible with core

I don't think you see the problem here. Consider this case as an example (I agree that this is case-by-case, so let's limit the scope to this one). You're forgetting that the original patch fixes a bug. In fact, it fixes a pretty serious UX bug in my opinion (and many others who supported merging this patch).

So to summarize, #3 is obviously not an option. For #2, are we supposed to block core development, and let this bug persist indefinitely, because of a much less serious bug in an extension? That really only leaves #1, but apparently the vast minority of opponents of the original patch decided it was a good idea to jump right over and skip to #2.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- -george william herbert george.herbert@gmail.com

Tyler Romeo

9:44 p.m.

On Thu, Mar 6, 2014 at 10:13 PM, George Herbert george.herbert@gmail.comwrote:

...

Based on the timing description here, it seems more like "Either rush 1 or rush 2".

This is also not true. Something does not have to be reverted in Gerrit in order for it to be undeployed from production. If there was any timing issue to consider here, I would say after a few days we'd have to reach a solution.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Matthew Flaschen

7 Mar 7 Mar

3:49 p.m.

On 03/06/2014 10:44 PM, Tyler Romeo wrote:

...

On Thu, Mar 6, 2014 at 10:13 PM, George Herbert george.herbert@gmail.comwrote:

...
Based on the timing description here, it seems more like "Either rush 1 or rush 2".

This is also not true. Something does not have to be reverted in Gerrit in order for it to be undeployed from production.

Yes, it does. Unless the entire branch has a serious problem (500s or major caching problems, etc.), we don't generally switch the entire branch back.

That means the only option is fix or revert a commit. The general rule is to do changes in master before cherry-picking to the branch.

Matt Flaschen

Tyler Romeo

3:54 p.m.

On Fri, Mar 7, 2014 at 4:49 PM, Matthew Flaschen mflaschen@wikimedia.orgwrote:

...

Yes, it does. Unless the entire branch has a serious problem (500s or major caching problems, etc.), we don't generally switch the entire branch back.

That means the only option is fix or revert a commit. The general rule is to do changes in master before cherry-picking to the branch.

What you're saying is that the software development process for MediaWiki is so tightly coupled with the operations deployment process, that development has to be held up because of problems in operations. That's a problem.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Greg Grossmeier

4:29 p.m.

...

On Fri, Mar 7, 2014 at 4:49 PM, Matthew Flaschen mflaschen@wikimedia.orgwrote:

...
Yes, it does. Unless the entire branch has a serious problem (500s or major caching problems, etc.), we don't generally switch the entire branch back.

That means the only option is fix or revert a commit. The general rule is to do changes in master before cherry-picking to the branch.

What you're saying is that the software development process for MediaWiki is so tightly coupled with the operations deployment process, that development has to be held up because of problems in operations. That's a problem.

Or a benefit, giving third-party users confidence that the core they use has a quick feedback loop with real users and is thoroughly tested.

It's all about perspective.

...

From these conversations, your perspective seems to be (and please

correct me if I'm wrong) that what WMF does with deployed code should have no bearing on MediaWiki core/master. And that all of our massive testing infrastructure equally shouldn't touch core/master.

Our "massive testing infrastructure" includes the Beta Cluster, Jenkins, SauceLabs, and the group0 wikis (test.wikipedia, mw.org etc). And, let's be realistic, the Wikimedia community as a whole. Code is never really tested until real users interact with it.

I probably mischaracterized your perspective, but I kinda wanted to make a point that this is all done with quality in mind, not the other way around.

Do you have a recommendation on how we would 'decouple' this while also keeping the same short feedback loop and testing rigor that we do have (and intend to increase, both in rigor and in speed of feedback)?

Thanks,

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Tyler Romeo

4:34 p.m.

On Fri, Mar 7, 2014 at 5:29 PM, Greg Grossmeier greg@wikimedia.org wrote:

...

Or a benefit, giving third-party users confidence that the core they use has a quick feedback loop with real users and is thoroughly tested.

It's all about perspective.

From these conversations, your perspective seems to be (and please correct me if I'm wrong) that what WMF does with deployed code should have no bearing on MediaWiki core/master. And that all of our massive testing infrastructure equally shouldn't touch core/master.

There is a difference between deploying code on a test cluster and deploying it to all production servers.

...

Our "massive testing infrastructure" includes the Beta Cluster, Jenkins, SauceLabs, and the group0 wikis (test.wikipedia, mw.org etc). And, let's be realistic, the Wikimedia community as a whole. Code is never really tested until real users interact with it.

Sure, but immediately deploying untested changes to all users is a reckless method of having real users test something.

...

I probably mischaracterized your perspective, but I kinda wanted to make a point that this is all done with quality in mind, not the other way around.

Do you have a recommendation on how we would 'decouple' this while also keeping the same short feedback loop and testing rigor that we do have (and intend to increase, both in rigor and in speed of feedback)?

I think some of the things mentioned here are good solution. The biggest problem here is that this patch was launched almost completely untested. It should have been caught long before it was put into production.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Greg Grossmeier

4:56 p.m.

...

Sure, but immediately deploying untested changes to all users is a reckless method of having real users test something.

[snip]

...

I think some of the things mentioned here are good solution. The biggest problem here is that this patch was launched almost completely untested. It should have been caught long before it was put into production.

Agreed, we need to address this. See the pipeline discussion.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

George Herbert

4:39 p.m.

On Fri, Mar 7, 2014 at 1:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Fri, Mar 7, 2014 at 4:49 PM, Matthew Flaschen <mflaschen@wikimedia.org

...
wrote:

...
Yes, it does. Unless the entire branch has a serious problem (500s or major caching problems, etc.), we don't generally switch the entire

branch

...
back.

That means the only option is fix or revert a commit. The general rule

is

...
to do changes in master before cherry-picking to the branch.

What you're saying is that the software development process for MediaWiki is so tightly coupled with the operations deployment process, that development has to be held up because of problems in operations. That's a problem.

With all due respect; hell, yes, development comes in second to operational stability.

This is not disrespecting development, which is extremely important by any measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be priority 1. Any large website's teams will have the same attitude.

I've had operational outages reach the top of everyone's news source/feed/newspaper/broadcast. This is an exceptionally unpleasant experience.

It's true that an enlightened balanced approach understands that existing instability that is baked in to prod must be developed out, in many cases, and that too many roadblocks in development will therefore be counterproductive. But too fast is lethal; too slow builds up technical debt, but at a comprehensible rate. If you aren't sure what just happened, the technical debt can be retired at everyone's patience and leisure, and with some feature slip.

-- -george william herbert george.herbert@gmail.com

David Gerard

4:43 p.m.

On 7 March 2014 22:39, George Herbert george.herbert@gmail.com wrote:

...

With all due respect; hell, yes, development comes in second to operational stability. This is not disrespecting development, which is extremely important by any measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be priority

Any large website's teams will have the same attitude.

I've had operational outages reach the top of everyone's news source/feed/newspaper/broadcast. This is an exceptionally unpleasant experience.

Speaking as a sysadmin with "it's gotta be up!" responsibility, I wholeheartedly concur.

But then this is an argument for "branch live 24 hours before deploy time", not "halt development". People can generally cope with two branches for 24 hours.

- d.

Tyler Romeo

4:54 p.m.

On Fri, Mar 7, 2014 at 5:39 PM, George Herbert george.herbert@gmail.comwrote:

...

With all due respect; hell, yes, development comes in second to operational stability.

This is not disrespecting development, which is extremely important by any measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be priority

Any large website's teams will have the same attitude.

I've had operational outages reach the top of everyone's news source/feed/newspaper/broadcast. This is an exceptionally unpleasant experience.

If you really think stability is top priority, then you cannot possibly think that the current deployment process is sane.

Right now you are placing the responsibility on the developers to make sure the site is stable, because any change they merge might break production since it is automatically sent out. If anything that gives the appearance that the operations team doesn't care about stability, and would rather wait until things break and revert them.

It is the responsibility of the operations team to ensure stability. Having to revert something because that's the only way production will be stable is not a proper workflow.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Chad

5:19 p.m.

On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

Right now you are placing the responsibility on the developers to make sure the site is stable, because any change they merge might break production since it is automatically sent out. If anything that gives the appearance that the operations team doesn't care about stability, and would rather wait until things break and revert them.

It is the responsibility of the operations team to ensure stability. Having to revert something because that's the only way production will be stable is not a proper workflow.

That's not how it works. Developers are responsible for helping with making sure the site is stable. It's the developers responsibility to write good code that's not going to crash the site. This has always been true. Long before you were here. Long before I was here too. That's what we get for working on MediaWiki master. It's part of what's expected out of you as a MediaWiki developer. If you don't like that you can go home and kindly let the rest of us get back to work.

Should we decouple the process of deployment from development further? Sure, let's talk about that. But please don't start about how it's not your freaking responsibility for crashing the site when you merge some code.

-Chad

George Herbert

5:22 p.m.

On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Fri, Mar 7, 2014 at 5:39 PM, George Herbert <george.herbert@gmail.com

...
wrote:

...
With all due respect; hell, yes, development comes in second to

operational

...
stability.

This is not disrespecting development, which is extremely important by

any

...
measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be

priority

...

Any large website's teams will have the same attitude.

I've had operational outages reach the top of everyone's news source/feed/newspaper/broadcast. This is an exceptionally unpleasant experience.

If you really think stability is top priority, then you cannot possibly think that the current deployment process is sane.

Every real business process reflects the history, organization, and people involved.

...

Right now you are placing the responsibility on the developers to make sure the site is stable, because any change they merge might break production since it is automatically sent out. If anything that gives the appearance that the operations team doesn't care about stability, and would rather wait until things break and revert them.

It is the responsibility of the operations team to ensure stability. Having to revert something because that's the only way production will be stable is not a proper workflow.

On the contrary, reverting things (from prod branch) because they destabilized production is normal procedure. Whether that's accomplished by frozen builds that are flexibly rolled to prod once they pass acceptable QA tests (more common in commercial service) and are rolled back out of prod should upgrade-related instability emerge, or reverts out of live prod branches in continuous deployment or similar environments (DevOps-y) depends on the underlying process.

This should not even be a debate. Questionable code in production shouldn't be. How that is accomplished is an implementation detail.

-- -george william herbert george.herbert@gmail.com

Ryan Lane

6:04 p.m.

On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Fri, Mar 7, 2014 at 5:39 PM, George Herbert <george.herbert@gmail.com

...
wrote:

...
With all due respect; hell, yes, development comes in second to

operational

...
stability.

This is not disrespecting development, which is extremely important by

any

...
measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be

priority

...

Any large website's teams will have the same attitude.

I've had operational outages reach the top of everyone's news source/feed/newspaper/broadcast. This is an exceptionally unpleasant experience.

If you really think stability is top priority, then you cannot possibly think that the current deployment process is sane.

Developers shouldn't be blocked on deployment or operations. Development is expensive and things will break either way. It's good to assume things will break and:

1. Have a simple way to revert 2. Put tests in for common errors 3. Have post-mortems where information is kept for historical purposes and bugs are created to track action items that come from them

...

Right now you are placing the responsibility on the developers to make sure the site is stable, because any change they merge might break production since it is automatically sent out. If anything that gives the appearance that the operations team doesn't care about stability, and would rather wait until things break and revert them.

Yes! This is a _good_ thing. Developers should feel responsible for what they build. It's shouldn't be operation's job to make sure the site is stable for code changes. Things should go more in this direction, in fact.

I'm not totally sure what you mean by "it's automatically sent out", though. Deploys are manual.

...

It is the responsibility of the operations team to ensure stability. Having to revert something because that's the only way production will be stable is not a proper workflow.

It's the responsibility of the operations team to ensure stability at the infrastructure level, not at the application level. It's sane to expect to revert things because things will break no matter what. Mean time to recovery is just as important or more important than mean time between failure.

- Ryan

Tyler Romeo

8 Mar 8 Mar

3:38 p.m.

On Fri, Mar 7, 2014 at 7:04 PM, Ryan Lane rlane32@gmail.com wrote:

...

Yes! This is a _good_ thing. Developers should feel responsible for what they build. It's shouldn't be operation's job to make sure the site is stable for code changes. Things should go more in this direction, in fact.

If you want to give me root access to the MediaWiki production cluster, then I'll start being responsible for the stability of the site.

Tell me something, what about the developers of MariaDB? Should they be responsible for WMF's stability? If they accidentally release a buggy version, are they expected to revert it within hours so that the WMF operations team can redeploy? Or will the operations team actually test new releases first, and refuse to update until things start working again?

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Brandon Harris

3:42 p.m.

https://en.wikipedia.org/wiki/Badger

On Mar 8, 2014, at 1:38 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Fri, Mar 7, 2014 at 7:04 PM, Ryan Lane rlane32@gmail.com wrote:

...
Yes! This is a _good_ thing. Developers should feel responsible for what they build. It's shouldn't be operation's job to make sure the site is stable for code changes. Things should go more in this direction, in fact.

If you want to give me root access to the MediaWiki production cluster, then I'll start being responsible for the stability of the site.

Tell me something, what about the developers of MariaDB? Should they be responsible for WMF's stability? If they accidentally release a buggy version, are they expected to revert it within hours so that the WMF operations team can redeploy? Or will the operations team actually test new releases first, and refuse to update until things start working again?

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

--- Brandon Harris, Senior Designer, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

Chad

10:13 p.m.

New rule: calling badger is synonymous with asking for the moderation bit for yourself.

-Chad On Mar 8, 2014 1:42 PM, "Brandon Harris" bharris@wikimedia.org wrote:

...

            https://en.wikipedia.org/wiki/Badger
On Mar 8, 2014, at 1:38 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...
On Fri, Mar 7, 2014 at 7:04 PM, Ryan Lane rlane32@gmail.com wrote:

...
Yes! This is a _good_ thing. Developers should feel responsible for what they build. It's shouldn't be operation's job to make sure the site is stable for code changes. Things should go more in this direction, in

fact.

...
...
If you want to give me root access to the MediaWiki production cluster, then I'll start being responsible for the stability of the site.

Tell me something, what about the developers of MariaDB? Should they be responsible for WMF's stability? If they accidentally release a buggy version, are they expected to revert it within hours so that the WMF operations team can redeploy? Or will the operations team actually test

new

...
releases first, and refuse to update until things start working again?

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brandon Harris, Senior Designer, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

11:34 p.m.

Chad wrote:

...

On Mar 8, 2014 1:42 PM, "Brandon Harris" bharris@wikimedia.org wrote:

...
https://en.wikipedia.org/wiki/Badger

New rule: calling badger is synonymous with asking for the moderation bit for yourself.

Fine, but first we have to ban the people who top-post and don't trim unnecessary parts of their e-mails. :-)

I personally read the badger link as a politer(?) version of calling someone a troll.

Given the tangent, I may as well note that, amusingly, https://meta.wikimedia.org/wiki/Badger contains a large red box that reads "STROM IS A GAY FAG". Perhaps we should do something about that. One day.

MZMcBride

Ryan Lane

9 Mar 9 Mar

12:03 a.m.

On Sat, Mar 8, 2014 at 9:34 PM, MZMcBride z@mzmcbride.com wrote:

...

Chad wrote:

...
On Mar 8, 2014 1:42 PM, "Brandon Harris" bharris@wikimedia.org wrote:

...
https://en.wikipedia.org/wiki/Badger

New rule: calling badger is synonymous with asking for the moderation bit for yourself.

Fine, but first we have to ban the people who top-post and don't trim unnecessary parts of their e-mails. :-)

I personally read the badger link as a politer(?) version of calling someone a troll.

It's actually something used on WMF staffer lists and it's supposed to be a cute way of telling people a thread is getting out of hand. Some people use it that way, and I feel Brandon was in this case, but it's most often used to shut down conversations that people find unpleasant or controversial and its actual effect is usually to enrage those it's used against.

I'd wager it has an overwhelmingly negative effect on communication (and WMF's internal politics likely prove that) and I'm with Chad in saying anyone that uses it here should be moderated. This is of course off topic, so I'll leave this at that.

- Ryan

Ryan Lane

8 Mar 8 Mar

4:06 p.m.

On Sat, Mar 8, 2014 at 1:38 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Fri, Mar 7, 2014 at 7:04 PM, Ryan Lane rlane32@gmail.com wrote:

...
Yes! This is a _good_ thing. Developers should feel responsible for what they build. It's shouldn't be operation's job to make sure the site is stable for code changes. Things should go more in this direction, in

fact.

...
If you want to give me root access to the MediaWiki production cluster, then I'll start being responsible for the stability of the site.

You don't work for WMF, so you personally have no responsibility for the stability of the site. It's the WMF developers who have a responsibility to ensure code they're pushing out won't break the site. As an aside, root isn't necessary to maintain MediaWiki on WMF's cluster ;).

Tell me something, what about the developers of MariaDB? Should they be

...

responsible for WMF's stability? If they accidentally release a buggy version, are they expected to revert it within hours so that the WMF operations team can redeploy? Or will the operations team actually test new releases first, and refuse to update until things start working again?

You're confused about the role of operations and development, especially with respect to organizations that let developers deploy and maintain the applications they develop.

- Ryan

Ryan Lane

4:10 p.m.

On Sat, Mar 8, 2014 at 2:06 PM, Ryan Lane rlane32@gmail.com wrote:

...

On Sat, Mar 8, 2014 at 1:38 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...
On Fri, Mar 7, 2014 at 7:04 PM, Ryan Lane rlane32@gmail.com wrote:

...
Yes! This is a _good_ thing. Developers should feel responsible for what they build. It's shouldn't be operation's job to make sure the site is stable for code changes. Things should go more in this direction, in

fact.

...
If you want to give me root access to the MediaWiki production cluster, then I'll start being responsible for the stability of the site.

You don't work for WMF, so you personally have no responsibility for the stability of the site. It's the WMF developers who have a responsibility to ensure code they're pushing out won't break the site. As an aside, root isn't necessary to maintain MediaWiki on WMF's cluster ;).

Of course, I'm really saying that everyone who has access to deploy has this responsibility.

- Ryan

Matthew Flaschen

12 Mar 12 Mar

2:11 p.m.

On 03/08/2014 05:10 PM, Ryan Lane wrote:

...

...
You don't work for WMF, so you personally have no responsibility for the stability of the site. It's the WMF developers who have a responsibility to ensure code they're pushing out won't break the site. As an aside, root isn't necessary to maintain MediaWiki on WMF's cluster ;).

Of course, I'm really saying that everyone who has access to deploy has this responsibility.

I would amend that to "everyone who has access to +2" and even (ideally) everyone who posts a patch to Gerrit, should try to avoid breaking master (and thus WMF production, since we have a tight release cycle).

We're not the only project with the goal that "Master should always work".

Automated tests help with this, but will never go all the way, and as you said, there will always be oversights, so quick reverts are also important.

Matt Flaschen

Marc A. Pelletier

8 Mar 8 Mar

7:38 p.m.

On 03/08/2014 04:38 PM, Tyler Romeo wrote:

...

Tell me something, what about the developers of MariaDB? Should they be responsible for WMF's stability? If they accidentally release a buggy version, are they expected to revert it within hours so that the WMF operations team can redeploy?

I'm going to pretend for a minute that this question was posed in earnest, and not as hyperbole.

The answer is: no, obviously not. And for that reason the MariaDB developers are not allowed to simply push their latest code on our infrastructure with a simple +2 to code review.

Notice the difference?

-- Marc

Tyler Romeo

8:21 p.m.

On Sat, Mar 8, 2014 at 8:38 PM, Marc A. Pelletier marc@uberbox.org wrote:

...

The answer is: no, obviously not. And for that reason the MariaDB developers are not allowed to simply push their latest code on our infrastructure with a simple +2 to code review.

Yes, and my point is that MediaWiki developers shouldn't be able to do that either! Receiving a +2 should not be the only line between a patch and deployment. Changes should be tested *before* deployment. Nobody is saying that developers are not responsible for writing good and working code, but there needs to be guards against things like this happening.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Ryan Lane

8:48 p.m.

On Sat, Mar 8, 2014 at 6:21 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Sat, Mar 8, 2014 at 8:38 PM, Marc A. Pelletier marc@uberbox.org wrote:

...
The answer is: no, obviously not. And for that reason the MariaDB developers are not allowed to simply push their latest code on our infrastructure with a simple +2 to code review.

Yes, and my point is that MediaWiki developers shouldn't be able to do that either! Receiving a +2 should not be the only line between a patch and deployment. Changes should be tested *before* deployment. Nobody is saying that developers are not responsible for writing good and working code, but there needs to be guards against things like this happening.

Wikimedia uses deployment branches. Just because someone +2/merges into master doesn't mean it immediately shows up on Wikimedia servers. It needs to go into a deployment branch, then it needs to get deployed by a person. Also, we use a gating model, so tests are required to pass before something is merged. I believe there are some tests that are essential, but take too long to run, so they aren't gating, but the situation isn't as dire as you claim.

- Ryan

Tyler Romeo

9:05 p.m.

On Sat, Mar 8, 2014 at 9:48 PM, Ryan Lane rlane32@gmail.com wrote:

...

Wikimedia uses deployment branches. Just because someone +2/merges into master doesn't mean it immediately shows up on Wikimedia servers. It needs to go into a deployment branch, then it needs to get deployed by a person. Also, we use a gating model, so tests are required to pass before something is merged. I believe there are some tests that are essential, but take too long to run, so they aren't gating, but the situation isn't as dire as you claim.

OK, then how did this change get deployed if it "broke" tests?

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Ryan Lane

9:15 p.m.

On Sat, Mar 8, 2014 at 7:05 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Sat, Mar 8, 2014 at 9:48 PM, Ryan Lane rlane32@gmail.com wrote:

...
Wikimedia uses deployment branches. Just because someone +2/merges into master doesn't mean it immediately shows up on Wikimedia servers. It

needs

...
to go into a deployment branch, then it needs to get deployed by a

person.

...
Also, we use a gating model, so tests are required to pass before

something

...
is merged. I believe there are some tests that are essential, but take

too

...
long to run, so they aren't gating, but the situation isn't as dire as

you

...
claim.

OK, then how did this change get deployed if it "broke" tests?

The jenkins report says it passed tests, hence why it was deployed. If there's other tests that aren't reporting to gerrit or if there's a test that needs to be added, maybe that's a post-mortem action to track?

- Ryan

Tyler Romeo

11:30 p.m.

On Sat, Mar 8, 2014 at 10:15 PM, Ryan Lane rlane32@gmail.com wrote:

...

The jenkins report says it passed tests, hence why it was deployed. If there's other tests that aren't reporting to gerrit or if there's a test that needs to be added, maybe that's a post-mortem action to track?

Yep. Hence the reason I think maybe we should work on something like the stuff mentioned in this thread. Maybe we should be running the mobile tests before deployment or something. I'm not sure what the exact solution is, but I think that would be a step in the right direction.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Martijn Hoekstra

10 Mar 10 Mar

2:51 a.m.

On Mar 9, 2014 4:15 AM, "Ryan Lane" rlane32@gmail.com wrote:

...

On Sat, Mar 8, 2014 at 7:05 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...
On Sat, Mar 8, 2014 at 9:48 PM, Ryan Lane rlane32@gmail.com wrote:

...
Wikimedia uses deployment branches. Just because someone +2/merges

into

...

...
...
master doesn't mean it immediately shows up on Wikimedia servers. It

needs

...
to go into a deployment branch, then it needs to get deployed by a

person.

...
Also, we use a gating model, so tests are required to pass before

something

...
is merged. I believe there are some tests that are essential, but take

too

...
long to run, so they aren't gating, but the situation isn't as dire as

you

...
claim.

OK, then how did this change get deployed if it "broke" tests?

The jenkins report says it passed tests, hence why it was deployed. If there's other tests that aren't reporting to gerrit or if there's a test that needs to be added, maybe that's a post-mortem action to track?

Ryan

If the test infrastructure can't handle running every test on every commit (why can't they by the way, and could that be remedied?) Would it be possible and desireable to make sure a more elaborate test suite with all available tests in run as a gate before cutting the production branch?

--Martijn

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Antoine Musso

3:10 p.m.

Le 10/03/2014 08:51, Martijn Hoekstra a écrit :

...

If the test infrastructure can't handle running every test on every commit (why can't they by the way, and could that be remedied?) Would it be possible and desireable to make sure a more elaborate test suite with all available tests in run as a gate before cutting the production branch?

Which is more or less what I would like us to achieve. Ie on each change made to core: run the browser tests of the most used extensions we have, making sure they do not break. That is heavy, but we have a scalable infrastructure to do it.

The priority for now is the staging beta cluster on which we run browser tests. It already saved our a**** a bunch of time and are faster to run than having to run all tests on any changes.

I still have to digest this huge thread, might end up writing a summary of what is needed on the continuous integration front.

For the curious, we are using a gate developed by OpenStack which basically plug Gerrit and Jenkins together and support running tests across multiple repositories. Some high level doc is at their upstream site: http://ci.openstack.org/zuul/gating.html

-- Antoine "hashar" Musso

Chris McMahon

10:50 a.m.

On Sat, Mar 8, 2014 at 8:05 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Sat, Mar 8, 2014 at 9:48 PM, Ryan Lane rlane32@gmail.com wrote:

OK, then how did this change get deployed if it "broke" tests?

The problem is not that the change broke tests. The problem is that the change broke account creation for Mobile view on a production wiki.

Let me repeat: the change broke account creation on a production wiki.

It was the tests that let us know that account creation was broken. That is what good tests are supposed to do. -Chris

Tyler Romeo

12:59 p.m.

On Mon, Mar 10, 2014 at 11:50 AM, Chris McMahon cmcmahon@wikimedia.orgwrote:

...

The problem is not that the change broke tests. The problem is that the change broke account creation for Mobile view on a production wiki.

Let me repeat: the change broke account creation on a production wiki.

It's been repeated multiple times, but I'll say it again: it is disputed as to whether account creation was "broken". It is just a question of design and user experience. No functionality was actually broken.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Brandon Harris

1:01 p.m.

On Mar 10, 2014, at 10:59 AM, Tyler Romeo tylerromeo@gmail.com wrote:

...

It's been repeated multiple times, but I'll say it again: it is disputed as to whether account creation was "broken". It is just a question of design and user experience. No functionality was actually broken.

This is a fairly limited view. The functionality was *broken*. It failed to work in the way it was expected to work. That’s what “broken” means.

--- Brandon Harris, Senior Designer, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

Tyler Romeo

1:05 p.m.

On Mon, Mar 10, 2014 at 2:01 PM, Brandon Harris bharris@wikimedia.orgwrote:

...

This is a fairly limited view. The functionality was *broken*. It failed to work in the way it was expected to work. That’s what “broken” means.

I'm not going to bother repeating myself. I recommend re-reading this thread for an explanation of how it is disputed as to whether this patch broke anything.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

George William Herbert

1:58 p.m.

As a rule, in industry practice, developers don't get to redefine expected functionality to avoid something being a bug.

Communications gaps on what expected functionality was are to some extent unavoidable. Some bugs slip into that crack. But, if both the test and users would have complained, it is a bug, regardless of what reasonable developer expectations were.

Yes, it sucks. But, this is what having real users (versus idealized ones) brings...

-george william herbert george.herbert@gmail.com

Sent from Kangphone

On Mar 10, 2014, at 11:05 AM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Mon, Mar 10, 2014 at 2:01 PM, Brandon Harris bharris@wikimedia.orgwrote:

...
This is a fairly limited view. The functionality was *broken*. It failed to work in the way it was expected to work. That’s what “broken” means.

I'm not going to bother repeating myself. I recommend re-reading this thread for an explanation of how it is disputed as to whether this patch broke anything.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Chris McMahon

1:38 p.m.

On Mon, Mar 10, 2014 at 10:59 AM, Tyler Romeo tylerromeo@gmail.com wrote:

...

It's been repeated multiple times, but I'll say it again: it is disputed as to whether account creation was "broken".

It is not disputed. When you get to the end of the account creation process and you do not have an account, that is broken. When if you manage to create an account anyway but the critical informative messages for new users do not appear, that is broken. Again, this is on a production wiki for all mobile users and all non-javascript clients. That is broken.

Matthew Flaschen

12 Mar 12 Mar

2:18 p.m.

On 03/10/2014 02:38 PM, Chris McMahon wrote:

...

On Mon, Mar 10, 2014 at 10:59 AM, Tyler Romeo tylerromeo@gmail.com wrote:

...
It's been repeated multiple times, but I'll say it again: it is disputed as to whether account creation was "broken".

It is not disputed. When you get to the end of the account creation process and you do not have an account, that is broken. When if you manage to create an account anyway but the critical informative messages for new users do not appear, that is broken. Again, this is on a production wiki for all mobile users and all non-javascript clients. That is broken.

It was not people reaching the end of the process without an account. It was one extra step in the process. I realize extra steps matter a lot on mobile, but these are still not the same thing.

Matt Flaschen

Niklas Laxström

8 Mar 8 Mar

3:15 a.m.

2014-03-08 0:39 GMT+02:00 George Herbert george.herbert@gmail.com:

...

This is not disrespecting development, which is extremely important by any measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be priority

Any large website's teams will have the same attitude.

Please do not forget the contributors who want to improve MediaWiki for their own needs. We also have to balance how much we inconvenience them to meet the requirements of WMF. In my opinion, the balance is already in favor of WMF.

-Niklas

Isarra Yos

11:38 a.m.

On 08/03/14 09:15, Niklas Laxström wrote:

...

2014-03-08 0:39 GMT+02:00 George Herbert george.herbert@gmail.com:

...
This is not disrespecting development, which is extremely important by any measure. But we're running a top-10 worldwide website, a key worldwide information resource for humanity as a whole. We cannot cripple development to try and maximize stability, but stability has to be priority

Any large website's teams will have the same attitude.

Please do not forget the contributors who want to improve MediaWiki for their own needs. We also have to balance how much we inconvenience them to meet the requirements of WMF. In my opinion, the balance is already in favor of WMF.

-Niklas

This.

MediaWiki serves many interests, and it's already hard enough for third-party users to contribute their features/improvements upstream that most simply don't even try. We should be trying to improve this, not make it even worse for them, because these are often things that would prove widely useful even if they are not /currently/ WMF priorities (it is not uncommon for them later become such and then have to be completely reimplemented).

There is always a balance to be had, and what is reasonable for developers to worry about/think of will never cover everything even if they are specifically thinking about Wikimedia projects. But not everyone is, and this should not be a blocker on top of everything else.

-I

Matthew Flaschen

7 Mar 7 Mar

3:58 p.m.

On 03/06/2014 09:58 PM, Tyler Romeo wrote:

...

I don't think you see the problem here. Consider this case as an example (I agree that this is case-by-case, so let's limit the scope to this one). You're forgetting that the original patch fixes a bug. In fact, it fixes a pretty serious UX bug in my opinion (and many others who supported merging this patch).

It's a UX deficiency some users care about and some others (I wanted 'lower case' and it gave me 'Lower_case') don't (or much less) (https://bugzilla.wikimedia.org/show_bug.cgi?id=61416).

It's not a straightforward bug, but a series of design tradeoffs.

It is a clear issue that it caused a MobileFrontend test failure (and the connected negative UX impact). Changes should never cause test failures in an ideal world (either change the test or don't merge the code), but in this case it's understandable with the status quo since it was in a different repo and did not run pre-merge.

...

So to summarize, #3 is obviously not an option. For #2, are we supposed to block core development, and let this bug persist indefinitely, because of a much less serious bug in an extension?

No one's saying 'indefinitely', just to come to a consensus and test with mobile.

Matt Flaschen

Greg Grossmeier

6 Mar 6 Mar

7:13 p.m.

...

On Thu, Mar 6, 2014 at 5:49 PM, OQ overlordq@gmail.com wrote:

...
So I'm confused on the timeline here.

Did the commit get merged before the testsuite found the breakage, or did the commit get merged despite the testsuite failing?

The commit was merged late Wednesday. The automated tests that demonstrated the problem failed over Wednesday night and we analyzed the failures early Thursday morning, which is routine.

And to clarify more: Not all tests run right away, some are more expensive and run on a schedule. These are part of that.

Should some of these tests be moved up to run immediately? Yeah, but we'd need to define the set of 'smoke tests' (tests that just test basic functionality, quickly) because we can't run all the tests all the time.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Jon Robson

7:29 p.m.

For the record this the test that alerted us to this issue was the following: https://wmf.ci.cloudbees.com/job/MobileFrontend-en.m.wikipedia.beta.wmflabs....

2 problems here - tests run only for the extension the code touches and currently browser tests only run after as these are slow and people would be annoyed if code took 20 minutes to merge. We've had issues in the past where changes in VisualEditor have broken things in MobileFrontend that trigger failed tests. Maybe core should run all browser tests for all deployed extensions as part of the merge process to avoid this?

"If MobileFrontend is so tightly coupled with the desktop login form, that is a problem with MobileFrontend." I don't really understand this. I get people moaning at me all the time that mobile has its own version of Watchlist compared to core, has its own skin etc etc and I get told the opposite that we should be closer to core. This is what we are striving for but we are not quite there yet.

"It seems that commenters here believe that the patch made it impossible to create an account if JavaScript was disabled, or via MobileFrontend - this is obviously not true, it just required an additional confirmation" I've not sure anyone has said it was impossible to create an account but user experience was badly effected as you point out. The statement I made was "Since most mobile device input fields default to lowercase ... pretty much anyone who now tries to register an account on mobile will see a warning that their username has been capitalized and will have to fill in the registration form again" [1]

[1] http://lists.wikimedia.org/pipermail/mobile-l/2014-March/006557.html

On Thu, Mar 6, 2014 at 5:13 PM, Greg Grossmeier greg@wikimedia.org wrote:

...

<quote name="Chris McMahon" date="2014-03-06" time="17:55:48 -0700"> > On Thu, Mar 6, 2014 at 5:49 PM, OQ <overlordq@gmail.com> wrote: > > > So I'm confused on the timeline here. > > > > Did the commit get merged before the testsuite found the breakage, or did > > the commit get merged despite the testsuite failing? > > > The commit was merged late Wednesday. The automated tests that > demonstrated the problem failed over Wednesday night and we analyzed the > failures early Thursday morning, which is routine.

And to clarify more: Not all tests run right away, some are more expensive and run on a schedule. These are part of that.

Should some of these tests be moved up to run immediately? Yeah, but we'd need to define the set of 'smoke tests' (tests that just test basic functionality, quickly) because we can't run all the tests all the time.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Jon Robson * http://jonrobson.me.uk * https://www.facebook.com/jonrobson * @rakugojon

Antoine Musso

7 Mar 7 Mar

9:14 a.m.

Le 07/03/2014 02:29, Jon Robson a écrit :

...

2 problems here - tests run only for the extension the code touches and currently browser tests only run after as these are slow and people would be annoyed if code took 20 minutes to merge. We've had issues in the past where changes in VisualEditor have broken things in MobileFrontend that trigger failed tests. Maybe core should run all browser tests for all deployed extensions as part of the merge process to avoid this?

Hello,

Indeed, we need more integration tests when gating changes in mediawiki/core. At a minimal: run unit tests from all the extensions which are deployed. Would be nice to run the QUnit tests as well.

Right now the core changes only run core tests and extensions do not even run the core one.

As for changes taking 2 hours to be merged, I don't think it is a problem as long as we can skip the gate entirely in case of emergency. Which we could do by adding live hack on the site :-]

And we need integration tests for the wmf branches!

-- Antoine "hashar" Musso

Matthew Flaschen

3:53 p.m.

On 03/06/2014 08:29 PM, Jon Robson wrote:

...

For the record this the test that alerted us to this issue was the following: https://wmf.ci.cloudbees.com/job/MobileFrontend-en.m.wikipedia.beta.wmflabs....

2 problems here - tests run only for the extension the code touches and currently browser tests only run after as these are slow and people would be annoyed if code took 20 minutes to merge.

It doesn't really affect this case (since the change that impacted mobile was not in the MobileFrontend repo), but that's really not long at all.

I would consider 20 minutes well worth the wait to have voting pre-merge browser tests.

Matt Flaschen

Greg Grossmeier

3:56 p.m.

...

On 03/06/2014 08:29 PM, Jon Robson wrote:

...
For the record this the test that alerted us to this issue was the following: https://wmf.ci.cloudbees.com/job/MobileFrontend-en.m.wikipedia.beta.wmflabs....

2 problems here - tests run only for the extension the code touches and currently browser tests only run after as these are slow and people would be annoyed if code took 20 minutes to merge.

It doesn't really affect this case (since the change that impacted mobile was not in the MobileFrontend repo), but that's really not long at all.

I would consider 20 minutes well worth the wait to have voting pre-merge browser tests.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Tim Starling

6 Mar 6 Mar

10:07 p.m.

On 07/03/14 11:38, Greg Grossmeier wrote:

...

The suite of automatic tests caught this bug, actually. It's how the mobile team found out about it as they got to work this morning. So the testing is quite robust.

As I understand it, there was no bug, it was just a controversial design change. That's not what automated testing is supposed to prevent. If Bartosz had known about the Cucumber test failure, he might have patched it to pass, in a change to be merged simultaneously.

-- Tim Starling

Tyler Romeo

5:54 p.m.

On Thu, Mar 6, 2014 at 6:34 PM, Brion Vibber bvibber@wikimedia.org wrote:

...

Is there anything specific in the communications involved that you found was problematic, other than a failure to include a backlink in the initial revert?

I think this entire thing was a big failure in basic software development and systems administration. If MobileFrontend is so tightly coupled with the desktop login form, that is a problem with MobileFrontend. In addition, the fact that a practically random code change was launched into production an hour later without so much as a test... That's the kind of thing that gets people fired at other companies.

But apparently I'm the only person that thinks this, so the WMF can feel free to do what it wants.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Ryan Lane

5:59 p.m.

On Thu, Mar 6, 2014 at 3:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

I think this entire thing was a big failure in basic software development and systems administration. If MobileFrontend is so tightly coupled with the desktop login form, that is a problem with MobileFrontend. In addition, the fact that a practically random code change was launched into production an hour later without so much as a test... That's the kind of thing that gets people fired at other companies.

At shitty companies maybe.

Things break. You do a post-mortem and track the things that lead to an outage and try to make sure it doesn't break again, ideally by adding automated tests, if possible.

- Ryan

Brion Vibber

6:05 p.m.

On Thu, Mar 6, 2014 at 3:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Thu, Mar 6, 2014 at 6:34 PM, Brion Vibber bvibber@wikimedia.org wrote:

...
Is there anything specific in the communications involved that you found was problematic, other than a failure to include a backlink in the

initial

...
revert?

I think this entire thing was a big failure in basic software development and systems administration. If MobileFrontend is so tightly coupled with the desktop login form, that is a problem with MobileFrontend.

As noted already in the thread, the commit was broken for non-JS users as well as for mobile; it's not a deficiency in MobileFrontend specifically.

...

In addition, the fact that a practically random code change was launched into production an hour later without so much as a test...

Aha -- this seems to strike to the heart of the matter. Would you agree this incident has more to do with problems with the branch deployment scheduling than with commit warring?

-- brion

That's the kind of thing that

...

gets people fired at other companies.

But apparently I'm the only person that thinks this, so the WMF can feel free to do what it wants.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Erik Bernhardson

6:08 p.m.

On Thu, Mar 6, 2014 at 4:05 PM, Brion Vibber bvibber@wikimedia.org wrote:

...

On Thu, Mar 6, 2014 at 3:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...
On Thu, Mar 6, 2014 at 6:34 PM, Brion Vibber bvibber@wikimedia.org wrote:

...
Is there anything specific in the communications involved that you

found

...
...
was problematic, other than a failure to include a backlink in the

initial

...
revert?

I think this entire thing was a big failure in basic software development and systems administration. If MobileFrontend is so tightly coupled with the desktop login form, that is a problem with MobileFrontend.

As noted already in the thread, the commit was broken for non-JS users as well as for mobile; it's not a deficiency in MobileFrontend specifically.

...
In addition, the fact that a practically random code change was launched into

production

...
an hour later without so much as a test...

Aha -- this seems to strike to the heart of the matter. Would you agree this incident has more to do with problems with the branch deployment scheduling than with commit warring?

-- brion

t

Does core have any policies related to merging? The core features team has adopted a methodology(although slightly different) that we learned of from the VE team. Essentially +2 for 24 hours before a deployment branch is cut is limited to fixes for bugs that were introduced since the last deployment branch was cut or reverts for patches that turned out to not be ready for deployment. Core is certainly bigger and with more participants, but perhaps a conversation about when to +2 and how that effects the deployment process would be benefitial?

...

That's the kind of thing that

...
gets people fired at other companies.

But apparently I'm the only person that thinks this, so the WMF can feel free to do what it wants.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Chris Steipp

6:31 p.m.

On Thu, Mar 6, 2014 at 4:08 PM, Erik Bernhardson <ebernhardson@wikimedia.org

...

wrote:

Does core have any policies related to merging? The core features team has adopted a methodology(although slightly different) that we learned of from the VE team. Essentially +2 for 24 hours before a deployment branch is cut is limited to fixes for bugs that were introduced since the last deployment branch was cut or reverts for patches that turned out to not be ready for deployment. Core is certainly bigger and with more participants, but perhaps a conversation about when to +2 and how that effects the deployment process would be benefitial?

Formally, no (not that I know of). Informally, I know a lot of us do a lot of merging on Fridays, partly for this reason. I resisted merging a big patch this morning because I want it to sit in beta for a while. I know a few patches were merged this morning so that they *would* make it into today's deploy. Everyone with +2 should always think about how/when things will be deployed, and merge as appropriate. And it seems like most people use good judgement most of the time.

If this is coming up a lot, then yeah, let's make some policy about it, or just enforce it in how we do the branch cutting.

C. Scott Ananian

7 Mar 7 Mar

12:44 a.m.

On Thu, Mar 6, 2014 at 7:08 PM, Erik Bernhardson <ebernhardson@wikimedia.org

...

wrote:

...

Does core have any policies related to merging? The core features team has adopted a methodology(although slightly different) that we learned of from the VE team. Essentially +2 for 24 hours before a deployment branch is cut is limited to fixes for bugs that were introduced since the last deployment branch was cut or reverts for patches that turned out to not be ready for deployment. Core is certainly bigger and with more participants, but perhaps a conversation about when to +2 and how that effects the deployment process would be benefitial?

While we're talking about +2s in core, I'd like to ask for special care on parser-related patches. Two issues:

1. Even minor changes can silently affect a large number of rendered pages. We are (slowly) beginning to create tools to search for affected (or deprecated) wikitext so that we can tell before deployment how widespread the changes are. Slow down and ask for the tools to be run to prevent surprises later.

2. We are trying to keep two different parsers in sync: parsoid and the PHP parser. For any significant change to the PHP parser (and sanitizer, and parser tests), there is likely a change needed to Parsoid (and vice versa). Try to make sure you get +1s from both teams before a patch gets +2ed.

If I had a procedures wishlist, it would be for:

a) a prominent link beside gerrit's +2 where teams could write project-specific +2 guidelines.

b) a gerrit page banner in the "24 hours before deployment" window for a given project; it's easy to lose track of the deployment train, especially if you work on multiple projects.

--scott

-- (http://cscott.net)

Thomas Tanon

2:59 a.m.

Le 7 mars 2014 à 07:44, C. Scott Ananian cananian@wikimedia.org a écrit :

...

If I had a procedures wishlist, it would be for:

a) a prominent link beside gerrit's +2 where teams could write project-specific +2 guidelines.

b) a gerrit page banner in the "24 hours before deployment" window for a given project; it's easy to lose track of the deployment train, especially if you work on multiple projects.

Strong +1 for a "24 hours before deployment" banner. It would help volunteers like me that don't follow all changes to the deployment train.

Thomas

Chris McMahon

6 Mar 6 Mar

6:16 p.m.

On Thu, Mar 6, 2014 at 4:54 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Thu, Mar 6, 2014 at 6:34 PM, Brion Vibber bvibber@wikimedia.org wrote:

...
Is there anything specific in the communications involved that you found was problematic, other than a failure to include a backlink in the

initial

...
revert?

I think this entire thing was a big failure in basic software development and systems administration. If MobileFrontend is so tightly coupled with the desktop login form, that is a problem with MobileFrontend. In addition, the fact that a practically random code change was launched into production an hour later without so much as a test...

It was in fact our automated browser test suite that alerted us that a change to some other area of the software overnight had broken some central MobileFrontend functionality. It was rather unexpected, and we moved quickly to identify the issue and revert it in the short amount of time we had before the code went to production.

...

That's the kind of thing that gets people fired at other companies.

But apparently I'm the only person that thinks this, so the WMF can feel free to do what it wants.

That sort of thing is not necessary.

-Chris

Bartosz Dziewoński

7 Mar 7 Mar

7:56 a.m.

On Fri, 07 Mar 2014 00:34:40 +0100, Brion Vibber bvibber@wikimedia.org wrote:

...

For years and years and years we've been very free about reverting things that break. No one, including old-timers like me and Tim, has the "right" to not have something reverted. If it needs to be reverted it will be reverted -- there is nothing personal in a revert. Remember it can always be put back once all problems are resolved.

I have missed the SVN period, but I am under impression that this was largely because "post-merge" code review has been used and that this is not something we should be overly proud of. :)

-- Matma Rex

Antoine Musso

9:18 a.m.

Le 07/03/2014 14:56, Bartosz Dziewoński a écrit :

...

On Fri, 07 Mar 2014 00:34:40 +0100, Brion Vibber bvibber@wikimedia.org wrote:

...
For years and years and years we've been very free about reverting things that break. No one, including old-timers like me and Tim, has the "right" to not have something reverted. If it needs to be reverted it will be reverted -- there is nothing personal in a revert. Remember it can always be put back once all problems are resolved.

I have missed the SVN period, but I am under impression that this was largely because "post-merge" code review has been used and that this is not something we should be overly proud of. :)

Reverting on spot because of site breakage still stand in our culture. I did a few revert without even thinking about contacting the original author because what mattered was to resume to a known state.

Most of our changes are not urgent and can well wait a couple days more to be refined. Heck, we can even deploy them out of the train if need be!

-- Antoine "hashar" Musso

Rob Lanphier

6 Mar 6 Mar

6:06 p.m.

Hi Tyler,

I understand you're frustrated here. As Jon says: "communication in the wikiverse is hard". Also, running a top 10 website is also hard.

Others have covered many of the other points, but I wanted to make sure I addressed one of the points that hasn't been covered yet:

On Thu, Mar 6, 2014 at 2:08 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

Changes to MediaWiki core should not have to take into account extensions that incorrectly rely on its interface, and a breakage in a deployed extension should result in an undeployment and a fix to that extension, not a revert of the core patch.

I wholeheartedly disagree with this. Changes to core should definitely take into account uses by widely-deployed extensions (where "widely-deployed" can either mean by installation count or by end-user count), even if the usage is "incorrect". We need to handle these things on a case by case basis, but in general, *all* of the following are options when a core change introduces an unintentional extension incompatibility: 1. Fix the extension quickly 2. Revert the change 3. Undeploy the extension until its fixed to be compatible with core

#3 is last and least. It should be exceedingly rare that we undeploy a long-running extension because of a newly-introduced core change.

It is often difficult to know exactly how core "should" be used, and sometimes, extension developers need to do things that seem hacky or wrong to achieve the desired result. It will often be the case that we'll need to continue to support misfeatures because breaking them would be too disruptive. Over time, if we improve our practices, this sort of tradeoff will need to happen less-and-less, but it does need to happen.

Drawing the analogy to wiki world, what has happened here is exactly this: https://en.wikipedia.org/wiki/Wikipedia:BOLD,_revert,_discuss_cycle

We're in the discuss part, which is actually where we should be.

Rob

Chris McMahon

4:07 p.m.

In over two years at WMF I have never been involved in a discussion like this, but here goes:

In this case, I think it was entirely appropriate to revert immediately and pick up the pieces later. The source of the code is immaterial, if Tim Starling or Brion Vibber had merged this we would have done exactly the same thing.

As Steven noted, the immediate issue was that it created a serious problem with the mobile account creation process. This blocked our ability to test other aspects of mobile account creation and login that have changed recently. And, since this occurred on Thursday morning in the run-up to the weekly deployment, we had little time to prevent this going live to production.

Beyond that, there are serious concerns with any feature that a) requires javascript support in the client in order to create an account on the wiki and b) does not honor the characters that the user types in the username and password fields. I know of at least one historical instance where violating b) caused a significant problem in UniversalLanguageSelector. We prevented the ULS problem from going live to production at the time, also.

-Chris

On Thu, Mar 6, 2014 at 1:29 PM, Tyler Romeo tylerromeo@gmail.com wrote:

...

Hi everybody,

I cannot believe I have to say something about this, but I guess it's no surprise.

Wikipedia has a notorious policy against edit warring, where users are encouraged to discuss changes and achieve consensus before blindly reverting. This applies even more so to Gerrit, since changes to software have a lot bigger effect.

Here's a nice example: https://gerrit.wikimedia.org/r/114400 https://gerrit.wikimedia.org/r/117234 https://gerrit.wikimedia.org/r/117247

Some key points to note here:

The revert commit was not linked to on the original commit

The time between the revert patch being uploaded and +2ed was a mere two

minutes

All the reviewers on the revert patch were also reviewers on the original

patch

This is unacceptable behavior, and is extremely disrespectful to the developers here. If you are going to revert a patch for reasons other than a blatant code review issue (such as a fatal error or the likes), you should *at the very least* give the original patch reviewers time to understand why the patch is being reverted and give their input on the matter. Otherwise it defeats the entire point of the code review process and Gerrit in the first place.

The argument being made in this specific case is that the change broke the workflow of mobile, and that the revert was announced on mobile-l. This is not sufficient for a number of reasons:

not everybody is subscribed to mobile-l, so you cannot expect the

original reviewers to see or know about it 2) this is an issue with MobileFrontend, not MediaWiki core 3) code being merged does not automatically cause a deployment, and if code being deployed breaks something in production, it is the operations team's job to undeploy that change

Overall, the lesson to take away here is to be more communicative with other developers, especially when you are negating their changes or decisions.

Thanks in advance, *-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tim Starling

9:58 p.m.

On 07/03/14 09:07, Chris McMahon wrote:

...

In over two years at WMF I have never been involved in a discussion like this, but here goes:

In this case, I think it was entirely appropriate to revert immediately and pick up the pieces later. The source of the code is immaterial, if Tim Starling or Brion Vibber had merged this we would have done exactly the same thing.

I would never have merged it, because it had a -1 from Steven Walling, apparently speaking on behalf of others on design-l. I think changes should be made by consensus.

On Gerrit, Bartosz Dziewoński wrote:

...

This has been silently reverted by Jon in https://gerrit.wikimedia.org/r/#/c/117234/ . I would appreciate leaving a comment here, or reverting through gerrit's interface instead of manually to do it automatically. It's not cool to do it like this.

I support this complaint. Please use the "revert" button in Gerrit to revert commits, or add a comment to the reverted change. A post to mobile-l is not sufficient.

-- Tim Starling

Isarra Yos

11:23 p.m.

On 07/03/14 03:58, Tim Starling wrote:

...

I would never have merged it, because it had a -1 from Steven Walling, apparently speaking on behalf of others on design-l. I think changes should be made by consensus.

The changeset was the result of the discussion on the Design list. The reason Steven Walling gave for the -1 was simply not true, but attempts to explain this failed and consensus apparently wound up being to ignore him.

Just for a little background/context here.

-I

Steven Walling

7 Mar 7 Mar

4:16 a.m.

On Fri, Mar 7, 2014 at 5:23 AM, Isarra Yos zhorishna@gmail.com wrote:

...

The changeset was the result of the discussion on the Design list. The reason Steven Walling gave for the -1 was simply not true, but attempts to explain this failed and consensus apparently wound up being to ignore him.

Just for a little background/context here

Just for more background/context here... As you can see in the latest comments on bug 61416, I'm not the only one who objects to the implementation as currently designed. Prior to the bug, Erik M. also proposed an alternative solution which was ignored for some reason.

That being said, I +2'd the quick revert because it made the mobile signup flow a mess. Whether it should have been merged in the first place is another debate.

Isarra Yos

12:41 p.m.

On 07/03/14 10:16, Steven Walling wrote:

...

Just for more background/context here... As you can see in the latest comments on bug 61416, I'm not the only one who objects to the implementation as currently designed. Prior to the bug, Erik M. also proposed an alternative solution which was ignored for some reason.

Aside from the blur part, Erik Möller's proposal was exactly what was implemented.

-I

MZMcBride

12:55 a.m.

Tim Starling wrote:

...

I would never have merged it, because it had a -1 from Steven Walling, apparently speaking on behalf of others on design-l. I think changes should be made by consensus.

The change also had five +1s, as noted in this thread. I find it interesting that, to you, consensus now includes a liberum veto. I'm fine with a temporary revert if the change was causing an issue in production, but the overall goal and execution seems fine to me.

As Jon notes at https://bugzilla.wikimedia.org/61416, there may be ways to optimize the overall behavior here in the future (e.g., by implementing a user.user_display_name field or re-using user.user_real_name). But for now, I don't see any substantive issue with this change or with merging it in core. Silently changing a user's preferred username is a valid bug: "Mzmcbride' plainly looks stupid and the silent and unexpected change to a user's input is an unfair trick on the user, especially given what a nightmare it is to rename a user account.

MZMcBride

Antoine Musso

9:27 a.m.

Le 07/03/2014 07:55, MZMcBride a écrit :

...

Tim Starling wrote:

...
...
I would never have merged it, because it had a -1 from Steven Walling, apparently speaking on behalf of others on design-l. I think changes should be made by consensus.

The change also had five +1s, as noted in this thread. I find it interesting that, to you, consensus now includes a liberum veto. I'm fine with a temporary revert if the change was causing an issue in production, but the overall goal and execution seems fine to me.

X oppose and Y approve is not a consensus. X+Y approve is.

So a single -1 should prevent a change from being submitted until that -1 is lifted by addressing the person concern(s) or correcting him/her or whatever.

-- Antoine "hashar" Musso

Bartosz Dziewoński

9:48 a.m.

On Fri, 07 Mar 2014 16:27:53 +0100, Antoine Musso hashar+wmf@free.fr wrote:

...

So a single -1 should prevent a change from being submitted until that -1 is lifted by addressing the person concern(s) or correcting him/her or whatever.

Note that such a rule never been followed by anyone, including by various WMF teams. You can easily find numerous examples, even if these searches are limited to only changesets where the -1 stuck on the last patchset.

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1,n,z (this search seems to hang forever) https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1+project:med...

-- Matma Rex

Antoine Musso

11:06 a.m.

Le 07/03/2014 16:48, Bartosz Dziewoński a écrit :

...

On Fri, 07 Mar 2014 16:27:53 +0100, Antoine Musso hashar+wmf@free.fr wrote:

...
So a single -1 should prevent a change from being submitted until that -1 is lifted by addressing the person concern(s) or correcting him/her or whatever.

Note that such a rule never been followed by anyone, including by various WMF teams. You can easily find numerous examples, even if these searches are limited to only changesets where the -1 stuck on the last patchset.

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1,n,z (this search seems to hang forever) https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1+project:med...

That is like ~10 changes per year on core, which suggest we attempt to get the -1 lifted before merging :-D

-- Antoine "hashar" Musso

Isarra Yos

12:50 p.m.

On 07/03/14 17:06, Antoine Musso wrote:

...

Le 07/03/2014 16:48, Bartosz Dziewoński a écrit :

...
On Fri, 07 Mar 2014 16:27:53 +0100, Antoine Musso hashar+wmf@free.fr wrote:

...
So a single -1 should prevent a change from being submitted until that -1 is lifted by addressing the person concern(s) or correcting him/her or whatever.

Note that such a rule never been followed by anyone, including by various WMF teams. You can easily find numerous examples, even if these searches are limited to only changesets where the -1 stuck on the last patchset.

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1,n,z (this search seems to hang forever) https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1+project:med...

That is like ~10 changes per year on core, which suggest we attempt to get the -1 lifted before merging :-D

We do attempt to get it lifted, but this is not always feasible. Here, as in other cases, the issues were discussed and a good faith effort was put in to address them (most were indeed resolved), and what more can we really do? Even when a minority opinion cannot convince others that something is bad, they will not necessarily change their minds and agree that it is good, either.

Sometimes it does turn out they are right, but does that mean we should automatically stand down whenever a single person disagrees? They should to be able to show they are right, and if they cannot, then will the change not show it later if it comes to it?

Coming to a standstill over these achieves nothing.

-I

Jon Robson

1:26 p.m.

Note, in case you didn't see, since this thread was getting extremely complicated to follow.. I started two threads - one about "Merging near deployment branch cut time" and "Notifying people when integration tests"

On Fri, Mar 7, 2014 at 10:50 AM, Isarra Yos zhorishna@gmail.com wrote:

...

On 07/03/14 17:06, Antoine Musso wrote:

...
Le 07/03/2014 16:48, Bartosz Dziewoński a écrit :

...
On Fri, 07 Mar 2014 16:27:53 +0100, Antoine Musso hashar+wmf@free.fr wrote:

...
So a single -1 should prevent a change from being submitted until that -1 is lifted by addressing the person concern(s) or correcting him/her or whatever.

Note that such a rule never been followed by anyone, including by various WMF teams. You can easily find numerous examples, even if these searches are limited to only changesets where the -1 stuck on the last patchset.

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1,n,z (this search seems to hang forever)

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1+project:med...

That is like ~10 changes per year on core, which suggest we attempt to get the -1 lifted before merging :-D

We do attempt to get it lifted, but this is not always feasible. Here, as in other cases, the issues were discussed and a good faith effort was put in to address them (most were indeed resolved), and what more can we really do? Even when a minority opinion cannot convince others that something is bad, they will not necessarily change their minds and agree that it is good, either.

Sometimes it does turn out they are right, but does that mean we should automatically stand down whenever a single person disagrees? They should to be able to show they are right, and if they cannot, then will the change not show it later if it comes to it?

Coming to a standstill over these achieves nothing.

-I

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Jon Robson * http://jonrobson.me.uk * https://www.facebook.com/jonrobson * @rakugojon

MZMcBride

9:27 p.m.

Antoine Musso wrote:

...

Le 07/03/2014 16:48, Bartosz Dziewoński a écrit :

...
On Fri, 07 Mar 2014 16:27:53 +0100, Antoine Musso hashar+wmf@free.fr wrote:

...
So a single -1 should prevent a change from being submitted until that -1 is lifted by addressing the person concern(s) or correcting him/her or whatever.

Note that such a rule never been followed by anyone, including by various WMF teams. You can easily find numerous examples, even if these searches are limited to only changesets where the -1 stuck on the last patchset.

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1,n,z (this search seems to hang forever)

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1+project: mediawiki/core,n,z

That is like ~10 changes per year on core, which suggest we attempt to get the -1 lifted before merging :-D

I believe these queries only display cases where there was an active -1 on the changeset when it was merged. As I understand it, any new patchset (for a rebase or otherwise) clears every previous -1. I imagine you're seeing dramatically underreported results from these queries.

That said, I'm not sure small integers are really what's important here. The feature in question seems sane and needs a few tweaks before being re-merged into core. No big deal. I don't think this is a war. The subject line chosen here was possibly a bit inflammatory. :-)

MZMcBride

Tim Starling

8 Mar 8 Mar

12:33 a.m.

On 08/03/14 02:48, Bartosz Dziewoński wrote:

...

On Fri, 07 Mar 2014 16:27:53 +0100, Antoine Musso hashar+wmf@free.fr wrote:

...
So a single -1 should prevent a change from being submitted until that -1 is lifted by addressing the person concern(s) or correcting him/her or whatever.

Note that such a rule never been followed by anyone, including by various WMF teams. You can easily find numerous examples, even if these searches are limited to only changesets where the -1 stuck on the last patchset.

https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1,n,z (this search seems to hang forever) https://gerrit.wikimedia.org/r/#/q/is:merged+label:Code-Review-1+project:med...

I wouldn't go so far as to say nothing with a -1 should ever be merged. I think objections should be resolved when the objecting party is actively involved and competent, and when the objection is substantial.

That is to say, if someone drops a -1 on a change, but despite their objections being answered in a subsequent comment or patchset, they have no further involvement, then the change can be merged after some time has elapsed.

And if someone complains about some minor creative element, such as whitespace, function naming, comment punctuation, etc., then that can be resolved without requiring consensus, e.g. by simple majority or by respecting the decision of the original author.

Or if someone is commenting outside their field of expertise, and their objections are wrong by widely-accepted matter of fact, those objections can be ignored.

But if someone is saying that the entire change is a bad idea, and has some solid reason for saying that, and has renewed their objection as appropriate, then I think that needs to be resolved by further discussion.

-- Tim Starling

Matthew Flaschen

7 Mar 7 Mar

4:18 p.m.

On 03/06/2014 05:07 PM, Chris McMahon wrote:

...

Beyond that, there are serious concerns with any feature that a) requires javascript support in the client in order to create an account on the wiki and b) does not honor the characters that the user types in the username and password fields.

I agree there were issues that could have been solved with better communication and consensus. However, the above is not accurate.

a) JavaScript was not required to create an account. However, it did show a confirmation screen, which required you to type more on Mobile (this would have been caught with more testing).

b) This is a long-standing MW issue, not an issue with the recent patch. In fact, it's the reason behind the patch. MW has always converted 'lower case' to 'Lower_case'). The goal of the recent patch was to let the user know early on (and give them a chance to reconsider username), so they're not surprised after signup.

You're right that in addition to the warning, it did the change client-side in a different way (at the last moment, on submit, not while typing), but what they typed was not 100% honored before either.

That goal may not have been accomplished in the best way (and we can even debate whether any UX change is required here), but it is not fair or accurate to blame MW's longstanding username canonicalization on this patch.

Also, the password was only affected in that it was cleared on submit, which applies to all warnings and errors.

Matt Flaschen

Benjamin Lees

6 Mar 6 Mar

4:30 p.m.

Special:Code automatically generated a list of followup revisions for each revision (based on linking/mentioning). It would be nice to have that in Gerrit.

Bartosz Dziewoński

7 Mar 7 Mar

7:59 a.m.

Filed https://bugzilla.wikimedia.org/show_bug.cgi?id=62371 .

-- Matma Rex

Bartosz Dziewoński

6 Mar 6 Mar

6:31 p.m.

It seems that commenters here believe that the patch made it impossible to create an account if JavaScript was disabled, or via MobileFrontend – this is obviously not true, it just required an additional confirmation (which was by design and +1'd by five people). Please stop spreading this disinformation.

-- Matma Rex

Bartosz Dziewoński

7 Mar 7 Mar

7:26 a.m.

Well, this has blown up quite a lot. I'm going to try to summarize and comment on the discussion so far, from the perspective of the "perpetrator" ;). Since we've gone off on quite a few tangents I'll send one e-mail for each one I see.

Regarding the patch itself [1]:

(tl;dr we all messed up, let's fix it)

I admittedly haven't considered its impact on extensions which might be doing surprising things to core's special pages, and I can understand how it caused issues. The severity of the issues, however, has been somewhat overstated (it doesn't require JS to create an account, like Chris said, for example).

Please stop repeating that it was "broken" (it was not; it worked as intended, you're just disagreeing with the design) and that this is not an issue in MobileFrontend (it is, if that patch is merged; MF simply becomes "incompatible" with core master). The issue of the confirmation popping up for lowercase usernames is nowhere near as insurmountable as it was painted to be and could be avoided with a few lines of JavaScript in MF (I even gave Jon a snippet on IRC).

The revert was done with an absolute total lack of communication with the people involved in the patch (there was only a brief e-mail to an unrelated mailing list[2]).

Furthermore, the blanket revert also killed the other part of the patch – live checking for invalid and taken usernames, which was a feature that has been requested for years now and which for some reason was not implemented during several reworkings and redesigns of the signup form.

While I still believe that the way this was done is the correct one (seamless warning with JS; additional confirmation step without JS), given the lack of consensus that is apparent now and wasn't apparent earlier (Steven was the only person who commented negatively on the patch) I'll rework it with a less disruptive workflow (seamless warning with JS, as before; a static short plaintext information with no confirmation step or exact warning without JS) and link the resulting patch here. Some useful comments were also left on the changeset and the bug after the revert (primarily by Jon), I'll try to address these as well (some help with the copy would be appreciated).

[1] https://gerrit.wikimedia.org/r/#/c/117234/ [2] http://lists.wikimedia.org/pipermail/mobile-l/2014-March/006557.html

(to be continued: about the deployment and about the browser testing)

Bartosz Dziewoński

12:20 p.m.

On Fri, 07 Mar 2014 14:26:48 +0100, Bartosz Dziewoński matma.rex@gmail.com wrote:

...

While I still believe that the way this was done is the correct one (seamless warning with JS; additional confirmation step without JS), given the lack of consensus that is apparent now and wasn't apparent earlier (Steven was the only person who commented negatively on the patch) I'll rework it with a less disruptive workflow (seamless warning with JS, as before; a static short plaintext information with no confirmation step or exact warning without JS) and link the resulting patch here. Some useful comments were also left on the changeset and the bug after the revert (primarily by Jon), I'll try to address these as well (some help with the copy would be appreciated).

I have resubmitted this, sans the controversial part, split into two parts, as https://gerrit.wikimedia.org/r/117437 and https://gerrit.wikimedia.org/r/117438 . Humbly requesting some help with the wording of the messages.

-- Matma Rex

Bartosz Dziewoński

7:36 a.m.

(continued, about the browser testing)

(tl;dr where are the tests and how do I know they fail?)

So, we have some slick browser tests. Awesome! But what's not good is that the tests run off-site, the results are not reported back to gerrit nor Bugzilla (unless someone manually files a bug, usually Chris) not IRC nor anywhere else, and are generally non-discoverable until someone shouts at you for breaking them. (As Tim guessed, I did not know about any failures until Jon told me.)

In fact, I still have no idea what exactly the tests encompass (I've heard about some browser tests for VE because I lurk a lot, never heard of any for core) or where to find them or how to run them. Either I'm slow or we have a serious documentation failure here.

Can something be done about it? Can we have the results reported somewhere visible – preferably to gerrit, as jenkins already reports some post-merge checks there? Or maybe we can have automatically filed bug reports if the build breaks? A bot reporting test status on #wikimedia-dev? Anything?

(I understand that the tests take too long to run them in the pre-merge checks.)

(Jon proposed reverting problematic changes outright, but to me that seems like a bit of an overreaction – bugs in tests and false positives happen, let's not make a huge fuss out of that.)

(to be continued: about the deployment)

Greg Grossmeier

10:52 a.m.

...

In fact, I still have no idea what exactly the tests encompass (I've heard about some browser tests for VE because I lurk a lot, never heard of any for core) or where to find them or how to run them. Either I'm slow or we have a serious documentation failure here.

Can something be done about it? Can we have the results reported somewhere visible – preferably to gerrit, as jenkins already reports some post-merge checks there? Or maybe we can have automatically filed bug reports if the build breaks? A bot reporting test status on #wikimedia-dev? Anything?

Yes. We didn't have the infrastructure to run these test in-house before, we're now working on rectifying that. Jenkins will (no firm date given for when) do this reporting back.

The current less than 100% helpful results[0] are irc-spammed in #wikimedia-qa, if you want to watch them as they come in.

[0] Less than 100% because they're testing way more than one commit, so you need to put your thinking cap on to determine who broke the test :)

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Bartosz Dziewoński

7:55 a.m.

(continued, about the deployment process)

I'm going to start this one with some quotes which summarize the topic better than I would.

On Fri, 07 Mar 2014 00:30:24 +0100, Jon Robson jdlrobson@gmail.com wrote:

...

The code was merged at the final hour before a deployment train (this is another issue that our deployment train doesn't distinguish between patches that have been sitting on master for a week with patches that have been sitting there for 1 hour). Had this been merged on a Thursday morning we would have had more luxury and a revert maybe could have been avoided (but I still don't think that patch was in a mergeable format).

On Fri, 07 Mar 2014 01:08:54 +0100, Erik Bernhardson ebernhardson@wikimedia.org wrote:

...

Does core have any policies related to merging? The core features team has adopted a methodology(although slightly different) that we learned of from the VE team. Essentially +2 for 24 hours before a deployment branch is cut is limited to fixes for bugs that were introduced since the last deployment branch was cut or reverts for patches that turned out to not be ready for deployment. Core is certainly bigger and with more participants, but perhaps a conversation about when to +2 and how that effects the deployment process would be benefitial?

On Fri, 07 Mar 2014 01:31:43 +0100, Chris Steipp csteipp@wikimedia.org wrote:

...

Formally, no (not that I know of). Informally, I know a lot of us do a lot of merging on Fridays, partly for this reason. I resisted merging a big patch this morning because I want it to sit in beta for a while. I know a few patches were merged this morning so that they *would* make it into today's deploy. Everyone with +2 should always think about how/when things will be deployed, and merge as appropriate. And it seems like most people use good judgement most of the time.

We use continuous integration, we keep the deployed version pretty close to master, everyone used to be glad we did that and now everyone is commenting how that's not good. :)

I really do not think that blocking merges into core for 15% of the week is going to do us any good. Core already has a problem with stale patches not being touched (for various reasons, but most stemming from the lack of reviewers) and creating more obstructions is not going to help.

(As a volunteer developer and currently the most active user of CR+2 rights [1], this would seriously interrupt my workflow. I do reviews and merges when I have free time and being forced to adhere to WMF's deployment schedule would impede me.)

If – if – we need additional delay between when patches are merged and when they are deployed, in addition to the ones we already have (semi-beta period on testwikis and mw.org), this should be solved on the technical level, not by a policy – simply branch off from master@{24 hours ago} (or 3 days, or 7 days, whatever) rather than master, or add a delay between a new branch being cut and it being deployed (AFAIK this now happens as instantly as technically possible). Blocking merges to the most active repository just because a certain organization has to do a deployment sound rather irresponsible to me.

(And that's it from me for now. I'll also reply to some tangenial emails inline.)

[1] https://toolserver.org/~nemobis/crstats/core.txt

C. Scott Ananian

11:08 a.m.

On Fri, Mar 7, 2014 at 8:55 AM, Bartosz Dziewoński matma.rex@gmail.comwrote:

...

I really do not think that blocking merges into core for 15% of the week is going to do us any good. Core already has a problem with stale patches not being touched (for various reasons, but most stemming from the lack of reviewers) and creating more obstructions is not going to help.

I agree. I think a better technical solution would be to halt jenkins' auto-merge for the 24 hour period, so that +2'ed changes are not automatically merged until after the branch is cut. Gerrit already lets you force the merge, bypassing jenkins, so that could be done if the patch is question was deployment-critical. --scott

Brad Jorsch (Anomie)

11:21 a.m.

On Fri, Mar 7, 2014 at 12:08 PM, C. Scott Ananian cananian@wikimedia.orgwrote:

...

I agree. I think a better technical solution would be to halt jenkins' auto-merge for the 24 hour period, so that +2'ed changes are not automatically merged until after the branch is cut.

I don't see how that's any better. Things still aren't getting merged.

If anything, the "cut using master@{24 hours ago}" is a much better idea.[1] Although it might be useful to see if Wednesday tends to be a relatively active bug-fixing day as the community on non-Wikipedia sites finds issues in the version that was deployed to them on Tuesday, in which case keeping those from making it into the new cut on Thursday (and so requiring more backports or waiting an extra week for fixes) might not be so great.

[1]: And yes, 'master@{24 hours ago}' is valid git syntax.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

C. Scott Ananian

11:37 a.m.

On Fri, Mar 7, 2014 at 12:21 PM, Brad Jorsch (Anomie) <bjorsch@wikimedia.org

...

wrote:

...

On Fri, Mar 7, 2014 at 12:08 PM, C. Scott Ananian <cananian@wikimedia.org

...
wrote:

...
I agree. I think a better technical solution would be to halt jenkins' auto-merge for the 24 hour period, so that +2'ed changes are not automatically merged until after the branch is cut.

I don't see how that's any better. Things still aren't getting merged.

If anything, the "cut using master@{24 hours ago}" is a much better idea.[1] Although it might be useful to see if Wednesday tends to be a relatively active bug-fixing day as the community on non-Wikipedia sites finds issues in the version that was deployed to them on Tuesday, in which case keeping those from making it into the new cut on Thursday (and so requiring more backports or waiting an extra week for fixes) might not be so great.

That makes it harder to apply deployable bug fixes/merges. Now you need to apply them against two different branches, so you've just moved the problem 24 hours back in time.

The benefit of halting auto-merge is that (a) it automatically resumes once the deploy happens, so volunteers don't have to come back to gerrit to repeat their approval, and (b) gerrit already has a convenient "publish and submit" button that can be used to easily merge deployment-relevant patches before the branch is cut. --scott

Brad Jorsch (Anomie)

12:12 p.m.

On Fri, Mar 7, 2014 at 12:37 PM, C. Scott Ananian cananian@wikimedia.orgwrote:

...

The benefit of halting auto-merge is that (a) it automatically resumes once the deploy happens, so volunteers don't have to come back to gerrit to repeat their approval,

Unless it merge-conflicts now where it didn't 24 hours ago.

...

and (b) gerrit already has a convenient "publish and submit" button that can be used to easily merge deployment-relevant patches before the branch is cut.

Which skips Jenkins running the unit tests.

Also, I believe it's possible to disable "publish and submit" in Gerrit; I think VE and UploadWizard do this, for example.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

3935

Age (days ago)

3941

Last active (days ago)

wikitech-l@lists.wikimedia.org

95 comments

29 participants

tags (0)

participants (29)

Antoine Musso
Bartosz Dziewoński
Benjamin Lees
Brad Jorsch (Anomie)
Brandon Harris
Brion Vibber
C. Scott Ananian
Chad
Chris McMahon
Chris Steipp
David Gerard
Erik Bernhardson
George Herbert
George William Herbert
Greg Grossmeier
Isarra Yos
Jon Robson
Marc A. Pelletier
Martijn Hoekstra
Matthew Flaschen
MZMcBride
Niklas Laxström
OQ
Rob Lanphier
Ryan Lane
Steven Walling
Thomas Tanon
Tim Starling
Tyler Romeo