Hi folks,
after speaking to a few folks, I'd like to check in on the WMF deployment train schedule overall, and see if there are ways to optimize it.
(Note: In the below I refer to "test wikis" vs. "production wikis", generously including mediawiki.org as a test wiki. I realize that our "test wikis", with the exception of Labs wikis, run on the production cluster.)
== Current practice ==
* On Thursdays we increase the release counter, and deploy the latest release to test wikis and the previous one to Wikipedias.
* On Mondays we deploy the latest release to non-Wikipedias.
== Problems with this approach ==
* We only have bits of Thursday and all of Friday to resolve issues that are surfaced in the test wikis prior to the Monday rollout to the first production wikis.
* Having two stages of release also increases the cognitive load on developers in understanding when their code hits production wikis, which arguably increases the risk of negative impact of a deploy going unnoticed.
== Advantages of this approach ==
* Commons serves just about enough traffic to sometimes act as a useful canary for performance/scaling issues that will later appear in production.
* Developers have some post-deployment time to fix issues highly specific to the non-Wikipedia wikis (e.g. extensions & gadgets only deployed there) rather than being distracted by firefighting on Wikipedia
== Some options ==
Option A: Change nothing. I've not heard from enough folks to see if the problems above are widely perceived to _be_ problems. If the consensus is that current practice, for now, is the best possible approach, obviously we should stick with it.
Option B: No Monday deploy. This would mean we'd have to improve our testing process to catch issues affecting the non-Wikipedia wikis before they hit production. I personally think getting rid of the Monday deploy could create some _desirable_ pain that would act as a forcing function to improve pre-release test practices, rather than using production wikis to test.
At the same time, we'd have a full week to work out the kinks we find in testing before they hit any production wiki, and could have a more systematic process of backing out changes if needed prior to deployment.
Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic.
Are there other ways to optimize / issues I'm missing or misrepresenting above?
Thanks, Erik