Let me clarify the reasoning for the idea:
We realized that some schema changes (which used to be scheduled like other deployments) no longer take 1 hour (they can take 1 month, running continuously like https://phabricator.wikimedia.org/T139090 , because it affects 3 of our largest tables). Also, they no longer requires read-only mode or affect code in anyway (unless they are a prerequisite).
On the other side, a schema change, combined with high read or write load from long-running maintenance jobs, like those of the updateCollation script, or any other (those where just an example), could potentially make lagging a worse problem: a single transaction has to store pending changes during its lifetime, or long-running reads can block and create pileups due to metadata locking. We want to avoid those, which certainly caused infrastructure issues in the past.
So, in summary, regular deployments are exclusive from each others. Long-running maintenance work could affect each other. This is a way for me (and others) to have visibility of those potential negative interactions, and make sure we can coordinate: "You are doing work on enwiki? No problem, we will just run this task for commons". "you need to do an emergency data recovery? I will wait to do this other task that can wait". Even if only DBAs use it, it is already useful to not perform incompatible changes at the same time. But it will be even more useful if everybody uses it!
On Thu, Sep 22, 2016 at 4:27 PM, Alex Monk amonk@wikimedia.org wrote:
I had been assuming that puppetised crons were not really relevant...
On 22 September 2016 at 15:19, Guillaume Lederrey <glederrey@wikimedia.org
wrote:
Hello!
Increasing visibility sounds like a great idea! How far do we want to go in that direction? In particular, I'm thinking of a few of the crons we have for Cirrus. For example, we do have daily crons on terbium that re-generate the suggester indices. Those can run for > 1h.
My understanding is that those kind of crons should not be considered scripts, but standard working parts of the system. Adding them will probably generate more noise than useful information. Is this a reasonable understanding?
Thanks!
Guillaume
On Wed, Sep 21, 2016 at 12:29 AM, Greg Grossmeier greg@wikimedia.org wrote:
In an effort to reduce surprises and potential mishaps it is now required to include any long running tasks in the deployment calendar[0].
"Long running tasks" include any script that is run on production 'work machines' such as terbium that last for longer than ~1 hour. Think: migration and maintenance scripts.
This was discussed and proposed in T144661[1].
Best,
Greg
[0] https://wikitech.wikimedia.org/wiki/Deployments Relevant diff: https://wikitech.wikimedia.org/w/index.php?diff=850923&oldid=850244 [1] https://phabricator.wikimedia.org/T144661
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Alex Monk VisualEditor/Editing team https://wikimediafoundation.org/wiki/User:Krenair_(WMF)
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering