Simplifying the WMF deployment cadence

List overview All Threads
Download

newer

older

Phabricator monthly statistics -...

Re: [Wikitech-l] [reading-wmf]...

Greg Grossmeier

27 May 2015 27 May '15

4:19 p.m.

Hi all,

Starting the week of June 8th we'll be transitioning our MediaWiki + Extensions deployment cadence to a shorter/simpler one. This will begin with 1.26wmf9.

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This is not only a lot simpler to understand ("wait, we deploy twice on Wednesday?") but it also shortens the time to get code to everyone (2 or 3 days from branch cut, depending on how you count).

== Transition == Transitions from one cadence to another are hard. Here's how we'll be doing this transition:

Week of June 1st (next week): * We'll complete the wmf8 rollout on June 3rd * However, we won't be cutting wmf9 on June 3rd

Week of June 8th (in two weeks): * We'll begin the new cadence with wmf9 on Tuesday June 9th

I hope this helps our users and developers get great new features and fixes faster.

Greg

endnotes: * The task: https://phabricator.wikimedia.org/T97553 * I'll be updating the relevant documentation before the transition

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Show replies by date

John

27 May 27 May

4:24 p.m.

Is there still the same week gap between version deployments to catch bugs?

On Wed, May 27, 2015 at 4:19 PM, Greg Grossmeier greg@wikimedia.org wrote:

...

Hi all,

Starting the week of June 8th we'll be transitioning our MediaWiki + Extensions deployment cadence to a shorter/simpler one. This will begin with 1.26wmf9.

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This is not only a lot simpler to understand ("wait, we deploy twice on Wednesday?") but it also shortens the time to get code to everyone (2 or 3 days from branch cut, depending on how you count).

== Transition == Transitions from one cadence to another are hard. Here's how we'll be doing this transition:

Week of June 1st (next week):

We'll complete the wmf8 rollout on June 3rd

However, we won't be cutting wmf9 on June 3rd

Week of June 8th (in two weeks):

We'll begin the new cadence with wmf9 on Tuesday June 9th

I hope this helps our users and developers get great new features and fixes faster.

Greg

endnotes:

The task: https://phabricator.wikimedia.org/T97553

I'll be updating the relevant documentation before the transition

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Greg Grossmeier

4:37 p.m.

...

Is there still the same week gap between version deployments to catch bugs?

No, the time to Wikipedias from branch cut is 2 days. I trust our code review, integration, and testing workflows. If it turns out this is too aggressive, we'll switch back to the previous cadence.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Ricordisamoa

4:24 p.m.

Il 27/05/2015 22:19, Greg Grossmeier ha scritto:

...

Hi all,

Starting the week of June 8th we'll be transitioning our MediaWiki + Extensions deployment cadence to a shorter/simpler one. This will begin with 1.26wmf9.

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This is not only a lot simpler to understand ("wait, we deploy twice on Wednesday?") but it also shortens the time to get code to everyone (2 or 3 days from branch cut, depending on how you count).

Two days... this is awesome.

...

== Transition == Transitions from one cadence to another are hard. Here's how we'll be doing this transition:

Week of June 1st (next week):

We'll complete the wmf8 rollout on June 3rd

However, we won't be cutting wmf9 on June 3rd

Week of June 8th (in two weeks):

We'll begin the new cadence with wmf9 on Tuesday June 9th

I hope this helps our users and developers get great new features and fixes faster.

Greg

endnotes:

The task: https://phabricator.wikimedia.org/T97553

I'll be updating the relevant documentation before the transition

Jon Robson

5:03 p.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

This is super awesome Greg. Thanks for making this happen. The deployment schedule has always been a huge source of pain for me.

On Wed, May 27, 2015 at 10:19 PM, Greg Grossmeier greg@wikimedia.org wrote:

...

Hi all,

Starting the week of June 8th we'll be transitioning our MediaWiki + Extensions deployment cadence to a shorter/simpler one. This will begin with 1.26wmf9.

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This is not only a lot simpler to understand ("wait, we deploy twice on Wednesday?") but it also shortens the time to get code to everyone (2 or 3 days from branch cut, depending on how you count).

== Transition == Transitions from one cadence to another are hard. Here's how we'll be doing this transition:

Week of June 1st (next week):

We'll complete the wmf8 rollout on June 3rd

However, we won't be cutting wmf9 on June 3rd

Week of June 8th (in two weeks):

We'll begin the new cadence with wmf9 on Tuesday June 9th

I hope this helps our users and developers get great new features and fixes faster.

Greg

endnotes:

The task: https://phabricator.wikimedia.org/T97553

I'll be updating the relevant documentation before the transition

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

S Page

5:58 p.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

Benito, Grossmeier! "He made the trains run on time" [1]

...

Tuesday: New branch cut, deployed to test wikis

and mediawiki.org as before, I assume.

[1] Or not, http://www.transportmyths.co.uk/mussolini.htm

-- =S Page WMF Tech writer

Greg Grossmeier

6:48 p.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

...

Benito, Grossmeier! "He made the trains run on time" [1]

...
Tuesday: New branch cut, deployed to test wikis

and mediawiki.org as before, I assume.

right right, I just mentally lump mw.org with "test wikis" ;)

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Dan Garry

28 May 28 May

7:51 a.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

Awesome! This will make many teams very happy since they'll be moving faster.

What's the criteria by which you will evaluate the success of this?

Thanks, Dan On 27 May 2015 10:19 pm, "Greg Grossmeier" greg@wikimedia.org wrote:

...

Hi all,

Starting the week of June 8th we'll be transitioning our MediaWiki + Extensions deployment cadence to a shorter/simpler one. This will begin with 1.26wmf9.

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This is not only a lot simpler to understand ("wait, we deploy twice on Wednesday?") but it also shortens the time to get code to everyone (2 or 3 days from branch cut, depending on how you count).

== Transition == Transitions from one cadence to another are hard. Here's how we'll be doing this transition:

Week of June 1st (next week):

We'll complete the wmf8 rollout on June 3rd

However, we won't be cutting wmf9 on June 3rd

Week of June 8th (in two weeks):

We'll begin the new cadence with wmf9 on Tuesday June 9th

I hope this helps our users and developers get great new features and fixes faster.

Greg

endnotes:

The task: https://phabricator.wikimedia.org/T97553

I'll be updating the relevant documentation before the transition

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

Risker

9:53 a.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

This is strictly a question from an uninvolved observer. Does this schedule provide for sufficient time and real-time/hands-on testing before changes hit the big projects?

An IRC discussion I was following last evening suggested to me that the first deploy (to test wikis and mw.org) probably did not get sufficient hands-on testing/utilization to surface many issues that would be significant on production wikis, which means only 24 hours on smaller non-wikipedia wikis, hoping that any problems will pop up before it's applied to dewiki, frwiki and enwiki.

I recognize the challenges in balancing continuous improvement and uptime - but if problems aren't surfaced before they hit wikipedias simply because the changes aren't activated by user actions or the problems aren't reported quickly enough, then it's probably going to make more work at the other end of the chain, with more likelihood that changes will need to be rolled back or patches having to be written on the fly. I have a lot of admiration for all of you who address these unplanned situations (it really is impressive to watch!), but I'd hate to see a lot of people constantly being pulled away from other tasks to problem-solve downtimes on big projects.

Risker/Anne

On 28 May 2015 at 07:51, Dan Garry dgarry@wikimedia.org wrote:

...

Awesome! This will make many teams very happy since they'll be moving faster.

What's the criteria by which you will evaluate the success of this?

Thanks, Dan On 27 May 2015 10:19 pm, "Greg Grossmeier" greg@wikimedia.org wrote:

...
Hi all,

Starting the week of June 8th we'll be transitioning our MediaWiki + Extensions deployment cadence to a shorter/simpler one. This will begin with 1.26wmf9.

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This is not only a lot simpler to understand ("wait, we deploy twice on Wednesday?") but it also shortens the time to get code to everyone (2 or 3 days from branch cut, depending on how you count).

== Transition == Transitions from one cadence to another are hard. Here's how we'll be doing this transition:

Week of June 1st (next week):

We'll complete the wmf8 rollout on June 3rd

However, we won't be cutting wmf9 on June 3rd

Week of June 8th (in two weeks):

We'll begin the new cadence with wmf9 on Tuesday June 9th

I hope this helps our users and developers get great new features and fixes faster.

Greg

endnotes:

The task: https://phabricator.wikimedia.org/T97553

I'll be updating the relevant documentation before the transition

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jon Robson

10:21 a.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

I suspect the idea is to lean more on our quality assurance infrastructure e.g. browser tests which I fully welcome.

The more developed they become the less chance of regressions making it to code let alone our projects.

When I joined 3 years ago we had no quality assurance infrastructure and now we've got things in a great place. They still need a little fine tuning but this should help us iron out the kinks by forcing us to rely on them more and push out better code. On 28 May 2015 3:53 pm, "Risker" risker.wp@gmail.com wrote:

...

This is strictly a question from an uninvolved observer. Does this schedule provide for sufficient time and real-time/hands-on testing before changes hit the big projects?

An IRC discussion I was following last evening suggested to me that the first deploy (to test wikis and mw.org) probably did not get sufficient hands-on testing/utilization to surface many issues that would be significant on production wikis, which means only 24 hours on smaller non-wikipedia wikis, hoping that any problems will pop up before it's applied to dewiki, frwiki and enwiki.

I recognize the challenges in balancing continuous improvement and uptime - but if problems aren't surfaced before they hit wikipedias simply because the changes aren't activated by user actions or the problems aren't reported quickly enough, then it's probably going to make more work at the other end of the chain, with more likelihood that changes will need to be rolled back or patches having to be written on the fly. I have a lot of admiration for all of you who address these unplanned situations (it really is impressive to watch!), but I'd hate to see a lot of people constantly being pulled away from other tasks to problem-solve downtimes on big projects.

Risker/Anne

On 28 May 2015 at 07:51, Dan Garry dgarry@wikimedia.org wrote:

...
Awesome! This will make many teams very happy since they'll be moving faster.

What's the criteria by which you will evaluate the success of this?

Thanks, Dan On 27 May 2015 10:19 pm, "Greg Grossmeier" greg@wikimedia.org wrote:

...
Hi all,

Starting the week of June 8th we'll be transitioning our MediaWiki + Extensions deployment cadence to a shorter/simpler one. This will begin with 1.26wmf9.

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This is not only a lot simpler to understand ("wait, we deploy twice on Wednesday?") but it also shortens the time to get code to everyone (2

or

...
...
3 days from branch cut, depending on how you count).

== Transition == Transitions from one cadence to another are hard. Here's how we'll be doing this transition:

Week of June 1st (next week):

We'll complete the wmf8 rollout on June 3rd

However, we won't be cutting wmf9 on June 3rd

Week of June 8th (in two weeks):

We'll begin the new cadence with wmf9 on Tuesday June 9th

I hope this helps our users and developers get great new features and fixes faster.

Greg

endnotes:

The task: https://phabricator.wikimedia.org/T97553

I'll be updating the relevant documentation before the transition

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian Wolff

1:54 p.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

On May 28, 2015 4:21 PM, "Jon Robson" jdlrobson@gmail.com wrote:

...

I suspect the idea is to lean more on our quality assurance infrastructure e.g. browser tests which I fully welcome.

The more developed they become the less chance of regressions making it to code let alone our projects.

When I joined 3 years ago we had no quality assurance infrastructure and now we've got things in a great place. They still need a little fine

tuning

...

but this should help us iron out the kinks by forcing us to rely on them more and push out better code.

Significantly better than three years ago, sure. However I would not use the phrase great place. There are still significant gaps in our coverage. For browser tests in particular Im given to understand that the asynchronous nature and somewhat high false positive rate make them not be taken as seriously as they should.

Fwiw, i have on several occasions recieved reports from users that I broke something (obviously i try to avoid that, but im far from perfect). I have never once had a browser test succesfully tell me I broke something in advanced (albeit it could be because i do backend things, but backend things do affect the front end when they explode). Occasionally the unit tests do, but i would still say there is a lot they dont.

Tl;dr: imo tests are great, but no where near replacing actual testing, at least for now.

That said, i dont think that the new deployment schedule will cause any problems, and is at the very least worth trying.

--bawolff

P.s. anyone remember writing code in the time between 1.16 and 1.17? You are all spoiled :p

Greg Grossmeier

12:52 p.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

...

This is strictly a question from an uninvolved observer. Does this schedule provide for sufficient time and real-time/hands-on testing before changes hit the big projects?

Yes. We still have Beta Cluster (production-like environment) which runs all code merged into master within 10 minutes of it being merged.

...

An IRC discussion I was following last evening suggested to me that the first deploy (to test wikis and mw.org) probably did not get sufficient hands-on testing/utilization to surface many issues that would be significant on production wikis, which means only 24 hours on smaller non-wikipedia wikis, hoping that any problems will pop up before it's applied to dewiki, frwiki and enwiki.

Honestly, that's the wrong perspective to take on that incident yesterday[0]. The issue is one that is hard to identify at low traffic levels (one that only really manifests itself at Wikipedia-scale with Wikipedia-scale caching). There will always be issues like this, unfortunately. The way to mitigate them better is by changing how we bucket requests to new or old versions of the software on production.

Currently we bucket by domain name/project site. This doesn't give us a lot of flexibility in testing new versions at scales that can show issues by not be "everyone". We would need to be able to deploy new versions based on percentage of overall requests (ie: 5% of all users to new version, then 10% of all users to new version, then everyone).

Best,

Greg

[0] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150527-Cookie

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Greg Grossmeier

12:46 p.m.

New subject: [Engineering] Simplifying the WMF deployment cadence

...

Awesome! This will make many teams very happy since they'll be moving faster.

...

What's the criteria by which you will evaluate the success of this?

1) the above (happier teams) 2) It's going to be hard to measure "success" but it'll be much easier to identify failure. I'll be talking with many PMs from WMF and looking through the SWAT deploys and incidents over the next two or more weeks (and on going, of course) to determine if this has caused any unmitigated pain.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Legoktm

1:17 p.m.

On 05/27/2015 01:19 PM, Greg Grossmeier wrote:

...

Hi all,

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This means that if we/users spot a bug once the train hits Wikipedias, or the bug is in an extension like PageTriage which is only used on the English Wikipedia, we have to: rush to make the 4pm SWAT window, deploy on Friday, or wait until Monday; which from what I remember were similar reasons from when we moved the train from Thursday to Wednesday.

-- Legoktm

Greg Grossmeier

1:28 p.m.

...

This means that if we/users spot a bug once the train hits Wikipedias, or the bug is in an extension like PageTriage which is only used on the English Wikipedia, we have to: rush to make the 4pm SWAT window, deploy on Friday, or wait until Monday; which from what I remember were similar reasons from when we moved the train from Thursday to Wednesday.

Emergency bug fixes are already OK on Fridays (just not "I want my new feature out").

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

John Mark Vandenberg

2:39 p.m.

On Fri, May 29, 2015 at 12:17 AM, Legoktm legoktm.wikipedia@gmail.com wrote:

...

On 05/27/2015 01:19 PM, Greg Grossmeier wrote:

...
Hi all,

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This means that if we/users spot a bug once the train hits Wikipedias, or the bug is in an extension like PageTriage which is only used on the English Wikipedia, we have to: rush to make the 4pm SWAT window, deploy on Friday, or wait until Monday; which from what I remember were similar reasons from when we moved the train from Thursday to Wednesday.

Recent API breakages suggest that this doesnt give enough time for client tests to be run, bugs reported, fixed and merged.

https://phabricator.wikimedia.org/T96942 was an API bug last month which completely broke pywikibot. All wikis; all use cases.

It was reported by pywikibot devs almost as soon as we detected that the test wikis were failing in our travis-ci tests. It was 12 hours before a MediaWiki API fix was submitted to Gerrit, and it took four additional *days* to get merged. The Phabricator task was marked Unbreak Now! all that time.

This also doesnt give clients sufficient time to workaround MediaWiki's wonderful intentional API breakages. e.g. raw continue, which completely broke pywikibot and needed a large chunk of code rewritten urgently, both for pywikibot core and the much older and harder to fix pywikibot compat, which is still used as part of processes that wiki communities rely on.

Another example is the action=help rewrite not being backwards compatible. pywikibot wasnt broken, as it only uses the help module for older MW releases; but it wouldnt surprise me if there are clients that were parsing the help text and they would have been broken.

-- John Vandenberg

Greg Grossmeier

3:07 p.m.

...

It was reported by pywikibot devs almost as soon as we detected that the test wikis were failing in our travis-ci tests. It was 12 hours before a MediaWiki API fix was submitted to Gerrit, and it took four additional *days* to get merged. The Phabricator task was marked Unbreak Now! all that time.

Which shows the tooling works, but not the social aspects. The backport process (eg SWAT and related things) will improve soon as well which should address much of this.

Not-a-great-response-but: can you specifically ping me in phabricator (I'm @greg) for issues like that above?

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

John Mark Vandenberg

5:11 p.m.

On Fri, May 29, 2015 at 2:07 AM, Greg Grossmeier greg@wikimedia.org wrote:

...

<quote name="John Mark Vandenberg" date="2015-05-29" time="01:39:52 +0700"> > It was reported by pywikibot devs almost as soon as we detected that > the test wikis were failing in our travis-ci tests. It was 12 hours > before a MediaWiki API fix was submitted to Gerrit, and it took four > additional *days* to get merged. The Phabricator task was marked > Unbreak Now! all that time.

Which shows the tooling works, but not the social aspects. The backport process (eg SWAT and related things) will improve soon as well which should address much of this.

Your tooling depends on pywikibot developers (all volunteers) merging a patch within your branch-deploy cycle, which fires off a Travis-CI build of *pywikibot* unit tests which runs some tests against test.wikipedia.org and test.wikidata.org ? And your proposing to shorten the window in which all this can happen and get useful bug reports out.

A little crazy but OK. The biggest problem with that approach is Travis-CI is not very reliable - often they are backlogged and tests are not run for days. So I suggest that you arrange to run the pywikibot tests daily (or more regularly) on WMF test/beta servers, and the unit tests of any other client which is a critical part of processes on the Wikimedia wikis.

...

Not-a-great-response-but: can you specifically ping me in phabricator (I'm @greg) for issues like that above?

That is a process problem. The MediaWiki ops & devs need to detect & escalate massive API breakages, especially after creating the fix which needs to be code reviewed.

-- John Vandenberg

Greg Grossmeier

5:26 p.m.

...

On Fri, May 29, 2015 at 2:07 AM, Greg Grossmeier greg@wikimedia.org wrote:

...
<quote name="John Mark Vandenberg" date="2015-05-29" time="01:39:52 +0700"> > It was reported by pywikibot devs almost as soon as we detected that > the test wikis were failing in our travis-ci tests. It was 12 hours > before a MediaWiki API fix was submitted to Gerrit, and it took four > additional *days* to get merged. The Phabricator task was marked > Unbreak Now! all that time.

Which shows the tooling works, but not the social aspects. The backport process (eg SWAT and related things) will improve soon as well which should address much of this.

Your tooling depends on pywikibot developers (all volunteers) merging a patch within your branch-deploy cycle, which fires off a Travis-CI build of *pywikibot* unit tests which runs some tests against test.wikipedia.org and test.wikidata.org ? And your proposing to shorten the window in which all this can happen and get useful bug reports out.

That's not "my" tooling, that's pywikibot's ;). But, the point is, there was a problem identified in your testing that was reported and fix submitted in a reasonable amount of time. The failure to get it merged, however, was the failure.

...

A little crazy but OK. The biggest problem with that approach is Travis-CI is not very reliable - often they are backlogged and tests are not run for days. So I suggest that you arrange to run the pywikibot tests daily (or more regularly) on WMF test/beta servers, and the unit tests of any other client which is a critical part of processes on the Wikimedia wikis.

I would support having pywikibot use WMF hosted integration testing. Please file a task with your current setup in the #continuous-integration-config project: https://phabricator.wikimedia.org/project/profile/1208/

...

...
Not-a-great-response-but: can you specifically ping me in phabricator (I'm @greg) for issues like that above?

That is a process problem. The MediaWiki ops & devs need to detect & escalate massive API breakages, especially after creating the fix which needs to be code reviewed.

Concur.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Brian Wolff

3:09 p.m.

On May 28, 2015 8:40 PM, "John Mark Vandenberg" jayvdb@gmail.com wrote:

...

On Fri, May 29, 2015 at 12:17 AM, Legoktm legoktm.wikipedia@gmail.com

wrote:

...

...
On 05/27/2015 01:19 PM, Greg Grossmeier wrote:

...
Hi all,

New cadence: Tuesday: New branch cut, deployed to test wikis Wednesday: deployed to non-wikipedias Thursday: deployed to Wikipedias

This means that if we/users spot a bug once the train hits Wikipedias, or the bug is in an extension like PageTriage which is only used on the English Wikipedia, we have to: rush to make the 4pm SWAT window, deploy on Friday, or wait until Monday; which from what I remember were similar reasons from when we moved the train from Thursday to Wednesday.

Recent API breakages suggest that this doesnt give enough time for client tests to be run, bugs reported, fixed and merged.

https://phabricator.wikimedia.org/T96942 was an API bug last month which completely broke pywikibot. All wikis; all use cases.

It was reported by pywikibot devs almost as soon as we detected that the test wikis were failing in our travis-ci tests. It was 12 hours before a MediaWiki API fix was submitted to Gerrit, and it took four additional *days* to get merged. The Phabricator task was marked Unbreak Now! all that time.

Shouldnt such tests be run against beta wiki not testwiki?

--bawolff

Antoine Musso

5:37 p.m.

Le 28/05/2015 20:39, John Mark Vandenberg a écrit : <snip>

...

This also doesnt give clients sufficient time to workaround MediaWiki's wonderful intentional API breakages. e.g. raw continue, which completely broke pywikibot and needed a large chunk of code rewritten urgently, both for pywikibot core and the much older and harder to fix pywikibot compat, which is still used as part of processes that wiki communities rely on.

Another example is the action=help rewrite not being backwards compatible. pywikibot wasnt broken, as it only uses the help module for older MW releases; but it wouldnt surprise me if there are clients that were parsing the help text and they would have been broken.

I cant stress how important pywikibot is! It covers so many functionalities and use cases that it is an excellent test stress for the API.

A low hanging fruit would be to run its test suite against beta (which runs tip of master) on an hourly basis.

-- Antoine "hashar" Musso

Brad Jorsch (Anomie)

29 May 29 May

10:36 a.m.

On Thu, May 28, 2015 at 2:39 PM, John Mark Vandenberg jayvdb@gmail.com wrote:

...

[T96942 https://phabricator.wikimedia.org/T96942] was reported by pywikibot devs almost as soon as we detected that the test wikis were failing in our travis-ci tests.

At 20:22 in the timezone of the main API developer (me).

...

It was 12 hours before a MediaWiki API fix was submitted to Gerrit,

09:31, basically first thing in the morning for the main API developer. There's really not much to complain about there.

...

and it took four additional *days* to get merged.

That part does suck.

...

This also doesnt give clients sufficient time to workaround MediaWiki's wonderful intentional API breakages. e.g. raw continue, which completely broke pywikibot and needed a large chunk of code rewritten urgently, both for pywikibot core and the much older and harder to fix pywikibot compat, which is still used as part of processes that wiki communities rely on.

The continuation change hasn't actually broken anything yet. It's coming soon though. Nor should a "large chunk of code" *need* rewriting, just add one parameter to your action=query requests.

Unless pywikibot was treating warnings as errors and that's what broke it? Or you're referring to unit tests rather than actual breakage? But a notice about the warnings was sent to mediawiki-api-announce in September 2014,[1] a bit over a month before the warnings started.[2]

[1]: https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2014-September/... [2]: https://gerrit.wikimedia.org/r/#/c/160222/

...

Another example is the action=help rewrite not being backwards compatible. pywikibot wasnt broken, as it only uses the help module for older MW releases; but it wouldnt surprise me if there are clients that were parsing the help text and they would have been broken.

Comments on that and other proposed changes were requested on mediawiki-api-announce in July 2014,[3] three months before the change[4] was merged. No concerns were raised at the requested location[5] or on the mediawiki-api mailing list.

[3]: https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2014-July/00006... [4]: https://gerrit.wikimedia.org/r/#/c/160798/ [5]: https://www.mediawiki.org/wiki/API/Architecture_work/Planning#HTMLizing_acti...

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

C. Scott Ananian

1:38 p.m.

I'll echo those nervous about the faster pace for deploys -- although not so nervous as to dig my feet in and yell stop. Mostly my concerns boil down to the fact that the beta environment isn't really a good test for anything other than absolute crashers. Hardly anyone uses the Collection extension (for example) on beta, for example.

I'd *really* like to see some effort put into improving beta. In particular, running beta with an up-to-date (but sanitized, perhaps) mirror of the main WP databases would ensure that we have a decent amount of test cases on beta. We recently spent a couple of hours trying to test Parsoid on beta, only to find out that the behavior we were chasing down was caused by the fact that beta's copy of the IPA formatting templates on enwiki (!) were out-of-date and incomplete. We really need to do better than [[English language]] as far as articles to test on enbeta. --scott

Greg Grossmeier

2:01 p.m.

New subject: Article coverage on Beta Cluster (was Re: Simplifying the WMF deployment cadence)

...

I'll echo those nervous about the faster pace for deploys -- although not so nervous as to dig my feet in and yell stop. Mostly my concerns boil down to the fact that the beta environment isn't really a good test for anything other than absolute crashers. Hardly anyone uses the Collection extension (for example) on beta, for example.

Sounds like you need more manual testing then? Who is tasked with doing that for your team?

...

I'd *really* like to see some effort put into improving beta. In particular, running beta with an up-to-date (but sanitized, perhaps) mirror of the main WP databases would ensure that we have a decent amount of test cases on beta. We recently spent a couple of hours trying to test Parsoid on beta, only to find out that the behavior we were chasing down was caused by the fact that beta's copy of the IPA formatting templates on enwiki (!) were out-of-date and incomplete. We really need to do better than [[English language]] as far as articles to test on enbeta.

Yeah, that'd be nice.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

John Mark Vandenberg

30 May 30 May

4:03 a.m.

On Fri, May 29, 2015 at 9:36 PM, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:

...

On Thu, May 28, 2015 at 2:39 PM, John Mark Vandenberg jayvdb@gmail.com wrote:

...
[T96942 https://phabricator.wikimedia.org/T96942] was reported by pywikibot devs almost as soon as we detected that the test wikis were failing in our travis-ci tests.

At 20:22 in the timezone of the main API developer (me).

...
It was 12 hours before a MediaWiki API fix was submitted to Gerrit,

09:31, basically first thing in the morning for the main API developer. There's really not much to complain about there.

In the proposed deploy sequence, 12 hours is a serious amount of time.

So if a shorter deploy process is implemented, we need to find ways to get bug reports to you sooner, and ensure you are not the only one who can notice and fix bugs related to API breakages, etc.

...

...
and it took four additional *days* to get merged.

That part does suck.

On a positive note, the API breakage this week was rectified much quicker and pywikibot test builds are green again.

https://phabricator.wikimedia.org/T100775 https://travis-ci.org/wikimedia/pywikibot-core/builds/64631025

...

...
This also doesnt give clients sufficient time to workaround MediaWiki's wonderful intentional API breakages. e.g. raw continue, which completely broke pywikibot and needed a large chunk of code rewritten urgently, both for pywikibot core and the much older and harder to fix pywikibot compat, which is still used as part of processes that wiki communities rely on.

The continuation change hasn't actually broken anything yet.

Hmm. You still don't appreciate that you actually really truly fair-dinkum broke pywikibot....? warnings are part of the API, and adding/changing them can break clients. warnings are one of the more brittle part of the API.

The impact on pywikibot core users wasnt so apparent, as the pywikibot core devs fixed the problem when it hit the test servers and it was merged before it hit production servers. Not all users had 'git pull' the latest pywikibot core code, and they informed us their bots were broken, but as far as I am aware we didnt get any pywikibot core bug reports submitted because (often after mentioning problems on IRC) their problems disappeared after they ran 'git pull'.

However pywikipedia / compat isnt actively maintained, and it broke badly in production, with some scripts being broken for over a month:

https://phabricator.wikimedia.org/T74667 https://phabricator.wikimedia.org/T74749

pywikibot core is gradually improving its understanding of the API warning system, but it isnt well supported yet. As a result, generally pywikibot reports warnings to the user. IIRC your API warnings system can send multiple distinct warnings as a single string, with each warning separated by only a new line, which is especially nasty for user agents to 'understand'. (but this may only be in older versions of the API - I'm not sure)

So adding a new warning to the API can result in the same warning appearing many many times on the user console / logs, and thousands of warnings on the screen sends the users into panic mode.

I strongly recommend fixing the warning system before using it again aggressively like was done for rawcontinue. e.g. It would be nice if the API emitted codes for each warning scenario (dont reuse codes for similar scenarios), so we don't need to do string matching to detect & discard expected warnings, and you can i18n those messages without breaking clients. (I think there is already a phab task for this.)

I also strongly recommend that Wikimedia gets heavily involved in decommissioning pywikibot compat bots on Wikimedia servers, and any other very old unmaintained clients, so that the API can be aggressively updated without breaking the many bots still using compat. pywikibot devs did some initial work with WMF staff at Lyon on this front, and we need to keep that moving ahead.

...

Unless pywikibot was treating warnings as errors and that's what broke it?

Yes, some of the code was raising an exception when it detected an API warning.

However another part of the breakage was that the JSON structure of the new rawcontinue warnings was not what was expected. Some pywikipedia / compat code assumed that the presence of warnings implied that there would be a warning related to the module used, e.g. result['warnings']['allpages'] , but that didnt exist because this warning was in result['warnings']['query'].

https://gerrit.wikimedia.org/r/#/c/176910/4/wikipedia.py,cm https://gerrit.wikimedia.org/r/#/c/170075/2/wikipedia.py,cm

...

It's coming soon though. Nor should a "large chunk of code" *need* rewriting, just add one parameter to your action=query requests.

I presume you mean we could have just added rawcontinue='' . Our very limited testing at the time (we didnt have much time to fix the bugs before it would hit production), and subsequent testing, indicates that the rawcontinue parameter can be used even for very old versions of MediaWiki. It appears to be silently ignored by earlier versions of MediaWiki. If this is true, it would have been good to very explicitly mention in the mediawiki-api-announce message that adding rawcontinue='' is backwards compatible , and to which version it will work.

To be safe, we implemented version detection so that rawcontinue is only used on the MediaWiki versions that require it. That was not trivial, as detecting the version means requesting action=query&meta=siteinfo , and that would cause this warning if rawcontinue wasnt used. Do you notice the loop there? In order to avoid using a feature on the wrong version, we need to use the feature. In case you are interested, the pywikibot core merged change was https://gerrit.wikimedia.org/r/#/c/168529/ (there were two other attempts at solving it which were abandoned.)

Thankfully, you eventually reduced the occurrences of those warnings (https://gerrit.wikimedia.org/r/#/c/201193/ ), so that a single simple meta=siteinfo query doesn't cause an unnecessary warning to appear in the result. I am not sure whether we can rely on that though, as you strenuously assert that meta=siteinfo could require continuation - if so, our site info abstraction layer is probably broken whenever meta=siteinfo actually does ever require continuation.

As a result of this, I ask that more care is given to changes to the output of siteinfo, as that is how clients detect your version, which is a critical step in negotiating the 'protocol version' used. pywikibot tries to use feature detection when sensible instead of version detection, but that is harder to implement and involves shooting off more requests to the server to detect what fails, slowing down the bootstrapping sequence for each site.

...

Or you're referring to unit tests rather than actual breakage?

unit tests, development processes, test process, and production usage. Everything broke. Jenkins was -1'ing all patches uploaded into Gerrit pywikibot because a very simple doctest failed. As a result, a solution needed to be merged before normal development activities could resume. ;-)

https://lists.wikimedia.org/pipermail/pywikipedia-l/2014-October/009140.html

...

But a notice about the warnings was sent to mediawiki-api-announce in September 2014,[1] a bit over a month before the warnings started.[2]

Yea, we're also to blame for not picking up on this. We are paying a bit more attention to that mailing list now as a result;-)

It is hard to know what would have worked better, but the 'its just a warning' mentality feels like the underlying problem. A single notice and a month delay is probably not enough, and "About a month from now" might have got a better response than "Sometime during the MediaWiki 1.25 development cycle".

IMO Wikimedia should have be doing impact analysis before flicking the switch. i.e. tracking how many Wikimedia wiki clients were switching to the new format to avoid the warnings during that month, and which user agents were not responding. I understand that is happening now before the final switch to 'continue' as default, but it would have been useful to start that process before releasing the warning.

...

...
Another example is the action=help rewrite not being backwards compatible. pywikibot wasnt broken, as it only uses the help module for older MW releases; but it wouldnt surprise me if there are clients that were parsing the help text and they would have been broken.

Comments on that and other proposed changes were requested on mediawiki-api-announce in July 2014,[3] three months before the change[4] was merged. No concerns were raised at the requested location[5] or on the mediawiki-api mailing list.

This change didnt affect pywikibot, so I dont know of any actual impact, and there may have been none, but afaik there was not appropriate notice of this intentional breakage, removing functionality. An RFC is great, and in that RFC there was a concern raise, albeit not backed with an actual usage which would be broken. A real notice of the actual breakage is still needed, even if there was no concerns raised during the RFC.

My point is not that these changes were catastrophic - tools were updated, the wikis lived on, and a few pywikibot compat users probably switched to pywikibot core as a result of a month long breakage.

My point is that shortening the timeframes of deploying reduces the ability for volunteers to detect and raise bugs about breakages, and/or update their client code.

-- John Vandenberg

Brad Jorsch (Anomie)

5:46 p.m.

On May 30, 2015 4:07 AM, "John Mark Vandenberg" jayvdb@gmail.com wrote:

...

So if a shorter deploy process is implemented, we need to find ways to get bug reports to you sooner,

I think work/life balance is going to continue to prevent me personally from getting things any sooner ;)

...

and ensure you are not the only one who can notice and fix bugs related to API breakages, etc.

Anyone *can*, just no one really *does* at the moment. The Platform reorg made a team that would have, but then the Engineering reorg changed that plan.

We decided to have the new Reading Infrastructure team take on API maintenance, but at the moment the team is short on developers.

...

IIRC your API warnings system can send multiple distinct warnings as a single string, with each warning separated by only a new line, which is especially nasty for user agents to 'understand'. (but this may only be in older versions of the API - I'm not sure)

Fixing that is on my todo list.

...

that there would be a warning related to the module used, e.g. result['warnings']['allpages'] , but that didnt exist because this warning was in result['warnings']['query'].

The query module is used there too ;)

3509

Age (days ago)

3512

Last active (days ago)

wikitech-l@lists.wikimedia.org

25 comments

14 participants

tags (0)

participants (14)

Antoine Musso
Brad Jorsch (Anomie)
Brian Wolff
C. Scott Ananian
Dan Garry
Greg Grossmeier
John
John Mark Vandenberg
Jon Robson
Jon Robson
Legoktm
Ricordisamoa
Risker
S Page