Hello all,
just a little story of what happened today: As you know I planed to dump the user-databases of rosemary today to import them on thyme later. Around 12 o'clock CET I looked at the replag of thyme during a break and everything was fine. After my dinner I looked in my mails seeing an email from the OSM-guys complaining that their title-dir was away. As a background information: thyme carries the nfs-server of the user-store, title and munin – these are normally on hemlock, but because hemlock's SAN-card is broken we had to move them to another server. Short time later I spoke with Nosy at IRC about thyme. She told me that thyme is inaccessible by SSH. Few days ago we had discovered that thyme's serial- console was not working (we have put that on the datacenter-to-do-list). But without SSH and serial-console you can not even reboot a server neither access. Nosy had started to move the nfs-server from thyme to rosemary and we completed that together. Because of the missing user-store the script that checks your quota at login failed and login to linux-servers was hardly possible. I deleted the script on these boxes and added a quick&dirty-fix to puppet. These fix failed later making the login at the linux-boxes impossible for some time (even for roots). The switching of the user-store from thyme to rosemary made some problems on the userland-servers (because user-store was busy), but I think we fixed this. Maybe we have to reboot some boxes in the next days – I will send a mail if needed. Thyme also carried my wikidata-replication-program which failed too (so the replag of wikidata everywhere increased). I moved it to another server now. A strange thing is that the mysql-process on thyme is still running; even replication is working so the replag will not increase there. The next step is to reach Mark or someone from the datacenter to reboot thyme and then look where the problem was. Munin shows nothing abnormal.
Just to let you know. Good night.
Sincerely, DaB.
On 13/11/12 01:46, DaB. wrote:
Thyme also carried my wikidata-replication-program which failed too (so the replag of wikidata everywhere increased). I moved it to another server now. A strange thing is that the mysql-process on thyme is still running; even replication is working so the replag will not increase there.
What about puppetd ? If it's still running, it could provide a way to restart the server.
It looks as if sshd had simply died. I would had blamed the OOM killer but thyme is Solaris, and it affected both sshd and the nfsd at the same time... (but not mysqld). Maybe an error with the filesystem umounting itself, but in that case sshd should still work.
Hello, At Tuesday 13 November 2012 14:04:40 DaB. wrote:
What about puppetd ? If it's still running, it could provide a way to restart the server.
stopped working with the other services. We have had the idea too :-).
Neither munin or nagios show problems with the memory at thyme before the crash, and normally the solaris SVC-system would restart SSH even if it was killed. The only strange thing I see in nagios is that thyme lost both SAN- connections during the first phase of the crash, regained them, recovered for 2 minutes, crashed again, but keep the SAN-connections this time.
Because thyme will be hard-restarted (power down/up) we will lost the "state" of the server, but maybe the syslog can tell us a little bit.
Sincerely, DaB.
toolserver-l@lists.wikimedia.org