Hi all
A few questions to provoke discussion/share knowledge better: * Why does the train run Tue,Wed, Thur rather than Mon,Tue,Wed * Why do we only have 2 group 1 Wikipedia's (Catalan and Hebrew) * Should there be a backport window Friday mornings for certain changes?
Longer spiel:
A few weeks ago a change I made led to a small but noticeable UI regression. The site was perfectly usable, but looked noticeably off. It was in a more obscure part of the UI so we missed it during QA/code review.
Late Wednesday a ticket was reported against Wikimedia commons, but I only became aware of it late Thursday when the regression rolled out to English Wikipedia. A village pump discussion was started and several duplicate tickets were created. While the site could still be used it didn't look great and upset the experience of many editors.
Once aware of the problem, the issue was easy to fix. A patch was written on Friday.
I understand Friday backports are possible, but my team tend to use them as a last resort in fear of creating more work for my fellow maintainers over weekend periods. As a result, given the site was still usable, the fix wasn't backported until the first available backport window on Monday. This is unfortunately a regular pattern, particularly for small UI regressions.
We addressed the issue on Monday, but I got feedback from several users that this particular issue took too long to get backported. I mentioned the no Friday deploy policy. One user asked me why we don't run the train Monday-Wednesday and to be honest I wasn't sure. I couldn't find anything on https://wikitech.wikimedia.org/wiki/Deployments/Train.
My team tries to avoid big changes on Mondays as Monday merged patches are more likely to have issues since they don't always get the time to go through QA during the week by our dedicated QA engineer.
So... Why don't we run the train Monday-Wednesday? Having a Thursday buffer during which we can more comfortably backport any issues not caught in testing, particularly UI bugs would be extremely helpful to my team and I don't think we'd lose much by losing the Monday to rush last-minute changes.
Assuming there are good reasons for Tuesday-Thursday train, I think there is another problem with our deploy process which is the size of group 1. Given the complexity of our interfaces (several skins, gadgets, multiple special pages, user preferences, gadgets, multiple extensions, and different user rights), generally, many obscure UI bugs get missed in QA by people who don't use the software every day and have a clear mental model of how it looks and behaves. My team mostly works on visible user interface changes and we rely heavily on Catalan and Hebrew Wikipedia users - our only group 1 wikis to notice errors with UI before they go out to a wider audience. Given the size of those audiences, that often doesn't work, and it's often group 2 wikis that make us aware of issues. If we are going to keep the existing train of Tue-Thur, I think it's essential we have at least one larger Wikipedia in our group 1 deploy to give us better protection against UI regressions living over the weekend. My understanding is for some reason this is not a decision release engineering can make, but one that requires an on-wiki RFC by the editors themselves. Is that correct? While I can understand the reluctance of editors to experience bugs, I'd argue that it's better to have a bug for a day than to have it for an entire weekend, and definitely something we need to think more deeply about.