Wow, I'm just the bearer of good news today!
The second hard drive on pliny, the one that holds most of the of the database, has stopped responding.
Log extract attached.
-- brion vibber (brion @ pobox.com)
On Thu, 2003-10-02 at 12:06, Brion Vibber wrote:
Wow, I'm just the bearer of good news today!
The second hard drive on pliny, the one that holds most of the of the database, has stopped responding.
Jason got it hard-rebooted (it wouldn't reboot nicely on its own) and it seems to have come back okay.
www.wikipedia.org is up and running.
I'm having some trouble apache on pliny talking to mysql, however, so all other sites are temporarily unable to access the database. Sigh.
(Oh, and larousse is back up and working.)
-- brion vibber (brion @ pobox.com)
On Thu, 2003-10-02 at 14:21, Brion Vibber wrote:
I'm having some trouble apache on pliny talking to mysql, however, so all other sites are temporarily unable to access the database. Sigh.
Got that sorted out. Had forced a failure in the db connect code earlier in the day to get the cache fallback working right, so then it wouldn't reconnect. :P
Anyway, we've suspected disk or SCSI controller problems on pliny before in relation to its crashes. If the main disk went out in the same way that the second disk did today, it would produce the customary symptoms: kernel and some processes stay running (so responds to ping, and partway through ssh connect), nothing in logs (because it can't write to the logs), and then after reboot it's as though nothing happened.
BLEEEEEEH!
Thanks to Jason for getting the hard reboot done, and for getting larousse back online as well.
-- brion vibber (brion @ pobox.com)
On Thu, 2003-10-02 at 14:36, Brion Vibber wrote:
Anyway, we've suspected disk or SCSI controller problems on pliny before in relation to its crashes. If the main disk went out in the same way that the second disk did today, it would produce the customary symptoms:
FWIW both drives are the same model, and are on the same controller. Could be a drive problem, could be a controller problem (though I'd then expect both drives to go at once). Could be anything. The log extract I posted earlier could be useful to someone who understands these things.
-- brion vibber (brion @ pobox.com)
Well, I hope we make it through the weekend, and then I hope the new hardware fixes things. The new motherboards have their own SCSI controllers, so if that's the problem, that'll fix it. If it's the drives themselves, well, we can buy new drives anytime, that's not really an obstacle.
Brion Vibber wrote:
On Thu, 2003-10-02 at 14:36, Brion Vibber wrote:
Anyway, we've suspected disk or SCSI controller problems on pliny before in relation to its crashes. If the main disk went out in the same way that the second disk did today, it would produce the customary symptoms:
FWIW both drives are the same model, and are on the same controller. Could be a drive problem, could be a controller problem (though I'd then expect both drives to go at once). Could be anything. The log extract I posted earlier could be useful to someone who understands these things.
-- brion vibber (brion @ pobox.com) _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Brion Vibber wrote:
Wow, I'm just the bearer of good news today!
The second hard drive on pliny, the one that holds most of the of the database, has stopped responding.
Log extract attached.
-- brion vibber (brion @ pobox.com)
The only thing I can see is that a command timed out on the controler which was probably still responding since all the ABORT command were successfull. I suppose the problem can come from the controler or from the disks.
I can't say more without study first the SCSI protocol and then the code of the scsi layer and the one of the controler driver.
-- Looxix
On Fri, 03 Oct 2003 00:42:50 +0200, Luc Van Oostenryck luc.vanoostenryck@easynet.be wrote:
The only thing I can see is that a command timed out on the controler which was probably still responding since all the ABORT command were successfull. I suppose the problem can come from the controler or from the disks.
Cable?
wikitech-l@lists.wikimedia.org