Le 17/04/13 22:37, Antoine Musso a écrit :
Hello,
I had to stop our continuous integration system a few minutes ago. A
nasty issue is causing the jobs to pill up and getting VERY slow to process.
The result is that no jobs are triggered and nothing is reported back in
Gerrit.
I am deploying a change as I write this email. Sorry for the
inconvenience, I will keep you updated.
Jenkins has been down for a couple hours on Wednesday April 18th from
9pm to 11pm.
== Summary ==
The root cause is that I forgot to deploy a configuration change for the
Zuul daemon (the software that takes patches from Gerrit and trigger
jobs in Jenkins).
The consequence is that on each job, git had to copy the repository
being tested from a disk to another one instead of using hardlinks.
That caused jobs to take a huge time and basically killed Jenkins.
== Chain of events ==
The chain of events (all times are GMT) is roughly:
Friday 12th April:
* Operations team kindly inserted an SSD device in the continuous
integration server. The device has been mounted on /srv/ssd/
* At 3pm I have configured Jenkins to use the new workspace.
* At 4pm I prepare Zuul related changes to later migrate its git
repositories to the SSD.
Monday 15th April:
* The CPU usage has been fine most of the weekend.
* 9:20am Restarted Jenkins due to an unrelated bug.
* 10pm Load raise a bit but nothing to worrying. I am barely monitoring
the activity while doing conf calls.
Tuesday 16th April:
* 4pm load start raising.
* 10pm Jenkins is stalled with a huge queue
* around midnight an attempt is made to tweak the Jenkins jobs to use
the replicated Gerrit repositories.
Wednesday 17th April:
* 6:45am the tweak is reverted. The Jenkins jobs must use the Zuul
repositories which contains the patchset merged on the tip of the
branch. The symptoms were that all jobs were not able to fetch the
refspec they needed.
* During the morning, a few more jobs need to be updated. I
investigate the git slowness and eventually move to something else.
* 8pm Jenkins job queue is full. I track down the actual root cause:
jobs are cloning from the slow disks to the ssd. Being different
devices, git has to fully copy the repository instead of using hardlinks.
* 8pm18 I shutdown Zuul to stop triggering new jobs and cancel all
pending Jenkins jobs.
* 8pm30 I migrate the Zuul repositories to be on the same disks the
Jenkins jobs are. That would let git clone creates hardlink and
dramatically speed up the cloning processes.
* The changes I have prepared on Friday 12th get merged and applied on
the server. Thus making Zuul use the new configuration.
* 9pm16 Git repositories have been migrated.
* 9pm20 Jenkins is starting.
* 10pm Jenkins is back up.
* Started migrating all the jobs to use the new git path
* 11pm Jobs migrated. Zuul restarted.
== Lesson learned ==
* I should not have switched Jenkins workspaces on Friday. That should
have been done on Monday together with the Zuul changes.
* git clone can be scary.
* hardlinks do not work from a device to another one.
* iotop is a must
* Jenkins need to start up faster (there is a patch upstream)
* Testing in labs does not catch everything.
My apologizes for the long disruption tonight :-(
--
Antoine "hashar" Musso