Hi,
one of our database servers, db22, had a disk failure a little while
ago, and while this failed disk was to be replaced another RAID
problem appeared.
This caused downtime of db22 and users started reporting problems at
around 7 pm:
19:09 < malafaya> so, what's wrong?
was even before:
19:19 <+nagios-wm> PROBLEM - Host db22 is DOWN: PING CRITICAL - Packet
loss = 100%
Since this affected CentralAuth, users kept getting error messages
like: [db22: s4] 10.0.6.32
Database ops immediately started moving a database slave to be the new
master, while the hardware issue on db22 is still being investigated.
The current effect is that commons is read-only. The expected downtime
was at 10 minutes when writing this.
--
--
Daniel Zahn <dzahn(a)wikimedia.org>
Show replies by date
By now everything _should_ be back to normal. Thanks for your patience.
This is what happened on the technical side:
18:37 maplebed: pushed out new db.php setting s4 to read-write
18:37 logmsgbot: ben synchronized wmf-config/db.php
18:35 maplebed: db31 made read-write as the new master for s4
18:31 maplebed: old master for s4 log file db22-bin.000106 log pos 631618956
18:30 maplebed: new master for s4: db31, log file db31-bin.000213
log pos is 205612709
18:24 logmsgbot: asher synchronized wmf-config/db.php 'setting s4
to read only, preparing to make db31 master'
18:21 Reedy: Commons having db issues, db22 (s4 master) has a disk issue
--
--
Daniel Zahn <dzahn(a)wikimedia.org>