Brion Vibber wrote:
The image fileserver is currently a potential problem,
as the application
servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and
if the server goes down it tends to hang for long periods of time, with similar
problem results.
Improvements to this could include minimizing our contact with the file server
(avoid unnecessary reads and checks for file existence; we've got a damn
database) and potentially using some more explicit file upload protocol which
can fail gracefully.
One fairly simple thing to do would be to reduce the NFS timeout
substantially. Currently we use a timeout of 1.4 seconds, then it backs off
exponentially for a total of 1.4 + 2.8 + 5.6 = 9.8 seconds. If I'm reading
this right, that's in addition to the RPC timeout, whatever that is. I don't
know what amane's typical response time is at peak load, but I suspect it's
orders of magnitude less than that.
The structural problem with NFS is that a timeout is required on every
request. There's no global state, so every attempted read incurs the same
timeout penalty. I believe this is not a problem with AFS:
http://www.openafs.org/pages/doc/UserGuide/auusg004.htm#HDRWQ17
We do have the same problem with MediaWiki's MySQL, memcached and search
access, but at least we have straightforward application-level control over
timeouts and retries.
-- Tim Starling