Same old story, disk full on a core master server (ixia) caused binlogs to
stop 10 minutes before the issue was noticed and I switched it into
read-only mode. Writes continued during those 10 minutes.
I'm resyncing from the master, the s2 wikis are in read-only mode while
that happens, it seems to be taking about 1.5 hours in total.
The server was in nagios and was reporting a critical disk full status.
I'm not sure exactly when it entered that state.
I'm inclined to think that the issue here is not the need for more
technology, but rather the need for procedures. There's no point in having
monitoring if nobody is watching the output.
If it had happened an hour later, I would have been in bed, and nobody
else was around. The users in #wikimedia-tech tell me they would have
waited for hours before trying to phone anyone. So we need out-of-hours
response procedures as well.
I think we need:
* A systems checklist to be checked daily, independently by two different
people and cross-checked weekly;
* An SMS paging system for out-of-hours response, both automated and
manual (user-driven).
-- Tim Starling
Show replies by date