Brion Vibber wrote:
The image fileserver is currently a potential problem, as the application servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and if the server goes down it tends to hang for long periods of time, with similar problem results.
Improvements to this could include minimizing our contact with the file server (avoid unnecessary reads and checks for file existence; we've got a damn database) and potentially using some more explicit file upload protocol which can fail gracefully.
One fairly simple thing to do would be to reduce the NFS timeout substantially. Currently we use a timeout of 1.4 seconds, then it backs off exponentially for a total of 1.4 + 2.8 + 5.6 = 9.8 seconds. If I'm reading this right, that's in addition to the RPC timeout, whatever that is. I don't know what amane's typical response time is at peak load, but I suspect it's orders of magnitude less than that.
The structural problem with NFS is that a timeout is required on every request. There's no global state, so every attempted read incurs the same timeout penalty. I believe this is not a problem with AFS:
http://www.openafs.org/pages/doc/UserGuide/auusg004.htm#HDRWQ17
We do have the same problem with MediaWiki's MySQL, memcached and search access, but at least we have straightforward application-level control over timeouts and retries.
-- Tim Starling