TL;DR: All webservices running on the grid engine backend in Toolforge
were restarted around 2019-10-18 21:29 UTC. Following the restart,
these jobs should retain the ability to write to their original
TMPDIR.
Earlier this week Musikanimal commented on a stale ticket [0] about a
mysteriously intermittent "(chunk.c.553) opening temp-file failed: No
such file or directory" error in a particular webservice. A related
bug [1] (now merged into the first as a duplicate) had been looked at
in depth previously by Zhuyifei1999 with no clear conclusion. I
started looking into the problem with little expectation of finding an
answer, but a hope that I could at least rule some things out as the
"root cause".
I got lucky this time and did figure out a root cause for the problem.
It turns out that Grid Engine creates a unique directory under /tmp
for each job that is started. This directory is named /tmp/{job
number}.{task number}.{queue name}. The job's main process is started
with the TMPDIR environment variable pointing to this unique
directory. Separately, we have a daily cron task which runs on each
Grid Engine exec node marked as a part of the webgrid-generic or
webgrid-lighttpd job queues to remove files and empty directories
under /tmp which have not been accessed in more than 24 hours. This
cleanup task was deleting the empty TMPDIR of jobs which had not
written to or read from their TMPDIR in more than 24 hours. Once I
made this connection, the fix was as simple as configuring the cleanup
task to ignore empty directories that look like the TMPDIR pattern
used by Grid Engine.
After the configuration change was deployed, I setup a temporary
webservice to monitor its own TMPDIR to verify that it was indeed
fixed. Earlier today that tool crossed the 48 hour runtime worst case
I had calculated with no recurrence of the error. With that
confirmation of the fix, I decided to restart all of the webservice
jobs running on the grid engine in Toolforge to ensure that they have
a TMPDIR created. This seemed like a better solution than just
emailing the cloud-announce list to tell folks to restart their
webservices if they were likely to be affected.
The process I went through in debugging is well documented on the task
[2]. The notes there do not include all the web searches I did for
various error messages and documentation of FOSS software involved in
the webservice, but they do pretty clearly show that I started out
looking in one place and ended up figuring out the root cause was
something completely different. The final analysis also shows how
fixing one problem [3] can unintentionally lead to new problems.
[0]:
https://phabricator.wikimedia.org/T217815
[1]:
https://phabricator.wikimedia.org/T225966
[2]:
https://phabricator.wikimedia.org/T217815#5577987
[3]:
https://phabricator.wikimedia.org/T190185
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808