Hello,
Today at 9am UTC, I started upgrading Zuul to a version that uses
Gearman to trigger jobs (and fix a bunch of issues). The upgrade had to
be cancelled, service got disrupted for an hour.
The Zuul code upgrade itself went well and was completed in half an hour
at 9:30am UTC.
Timo reported that a VisualEditor testing job was failing because it has
been triggered on an incorrect Jenkins slaves. Our setup has several
slaves, each configured for different purposes, so our jobs are tied to
specific slaves by using Jenkins labels.
I thought it might be related to the Gearman plugin in Jenkins hence I
restarted Jenkins. That took a roughly 15 minutes and did not solve the
issue.
At 10:05am UTC, Zuul was downgraded and restarted.
At 10:14am UTC, Jenkins completed restart and service got resumed.
The root cause is in Jenkins Gearman plugin which trigger jobs on any
registered slaves even if they have been set to only run jobs tied to
them explicitly.
The upgrade itself went well since I tested it out several times in labs
and documented all the steps on the wiki:
https://www.mediawiki.org/wiki/Continuous_integration/Zuul/gearman_upgrade
I should have tested the multiple slaves setup we are using in labs. The
labs only has one runner though :/
Related server admin log:
https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&diff=…
I will report the issue to upstream and find out whether they can fix
that issue for our setup. Meanwhile, I am postponing Zuul upgrade
indefinitely :-(
--
Antoine "hashar" Musso