Same old story, disk full on a core master server (ixia) caused binlogs to stop 10 minutes before the issue was noticed and I switched it into read-only mode. Writes continued during those 10 minutes.
I'm resyncing from the master, the s2 wikis are in read-only mode while that happens, it seems to be taking about 1.5 hours in total.
The server was in nagios and was reporting a critical disk full status. I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more technology, but rather the need for procedures. There's no point in having monitoring if nobody is watching the output.
If it had happened an hour later, I would have been in bed, and nobody else was around. The users in #wikimedia-tech tell me they would have waited for hours before trying to phone anyone. So we need out-of-hours response procedures as well.
I think we need: * A systems checklist to be checked daily, independently by two different people and cross-checked weekly; * An SMS paging system for out-of-hours response, both automated and manual (user-driven).
-- Tim Starling