TL;DR: All webservices running on the grid engine backend in Toolforge were restarted around 2019-10-18 21:29 UTC. Following the restart, these jobs should retain the ability to write to their original TMPDIR.
Earlier this week Musikanimal commented on a stale ticket [0] about a mysteriously intermittent "(chunk.c.553) opening temp-file failed: No such file or directory" error in a particular webservice. A related bug [1] (now merged into the first as a duplicate) had been looked at in depth previously by Zhuyifei1999 with no clear conclusion. I started looking into the problem with little expectation of finding an answer, but a hope that I could at least rule some things out as the "root cause".
I got lucky this time and did figure out a root cause for the problem. It turns out that Grid Engine creates a unique directory under /tmp for each job that is started. This directory is named /tmp/{job number}.{task number}.{queue name}. The job's main process is started with the TMPDIR environment variable pointing to this unique directory. Separately, we have a daily cron task which runs on each Grid Engine exec node marked as a part of the webgrid-generic or webgrid-lighttpd job queues to remove files and empty directories under /tmp which have not been accessed in more than 24 hours. This cleanup task was deleting the empty TMPDIR of jobs which had not written to or read from their TMPDIR in more than 24 hours. Once I made this connection, the fix was as simple as configuring the cleanup task to ignore empty directories that look like the TMPDIR pattern used by Grid Engine.
After the configuration change was deployed, I setup a temporary webservice to monitor its own TMPDIR to verify that it was indeed fixed. Earlier today that tool crossed the 48 hour runtime worst case I had calculated with no recurrence of the error. With that confirmation of the fix, I decided to restart all of the webservice jobs running on the grid engine in Toolforge to ensure that they have a TMPDIR created. This seemed like a better solution than just emailing the cloud-announce list to tell folks to restart their webservices if they were likely to be affected.
The process I went through in debugging is well documented on the task [2]. The notes there do not include all the web searches I did for various error messages and documentation of FOSS software involved in the webservice, but they do pretty clearly show that I started out looking in one place and ended up figuring out the root cause was something completely different. The final analysis also shows how fixing one problem [3] can unintentionally lead to new problems.
[0]: https://phabricator.wikimedia.org/T217815 [1]: https://phabricator.wikimedia.org/T225966 [2]: https://phabricator.wikimedia.org/T217815#5577987 [3]: https://phabricator.wikimedia.org/T190185
Bryan