db22 failure causing CentralAuth problems - Wikitech-l

6 Jan 2012

Hi,

one of our database servers, db22, had a disk failure a little while
ago, and while this failed disk was to be replaced another RAID
problem appeared.
This caused downtime of db22 and users started reporting problems at
around 7 pm:

19:09 < malafaya> so, what's wrong?
was even before:

19:19 <+nagios-wm> PROBLEM - Host db22 is DOWN: PING CRITICAL - Packet
loss = 100%

Since this affected CentralAuth, users kept getting error messages
like: [db22: s4] 10.0.6.32

Database ops immediately started moving a database slave to be the new
master, while the hardware issue on db22 is still being investigated.

The current effect is that commons is read-only. The expected downtime
was at 10 minutes when writing this.

-- 
--
Daniel Zahn &lt;dzahn(a)wikimedia.org&gt;