Brion Vibber wrote:
Browne is mysteriously down for the moment. An initial
reboot got it
running again; Tim has some syslog bits he could probably post here
about some sort of problem. Apparently it went down again shortly
thereafter, and my attempt to power cycle it didn't get it back, at
least not back on the network.
I don't have the whole thing, just the few lines I posted to IRC at the
time.
First, the log showed squid operating normally. Then it died, with this
sort of thing being written to the log:
Apr 26 01:43:18 browne kernel: Slab corruption: start=295d1894, len=504
Apr 26 01:43:18 browne kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Apr 26 01:43:18 browne kernel: Last user:
[<02185132>](destroy_inode+0x36/0x45)
Apr 26 01:43:18 browne kernel: 030: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
cc 1c 5d 29
Apr 26 01:43:18 browne kernel: Prev obj: start=295d1690, len=504
Then shortly afterwards, the squid automatically came back on with a
different PID. Kernel messages such as the following were displayed:
Apr 26 01:44:28 browne kernel: slab: Internal list corruption detected
in cache 'dentry_cache'(14), slabp 64c7f000(12). Hexdump:
...
Apr 26 01:44:28 browne kernel: invalid operand: 0000 [#1]
The server was contactable for some time after the squid restart, maybe
10 minutes. Then it stopped responding to ganglia, ssh or HTTP requests.
It was, however, still pingable. The system log during this time showed
crond causing a kernel error like the one above, once every minute.
There was no other visible activity. This situation continued until the
machine was power-cycled.
I don't know enough about the kernel to speculate on what went wrong on
the basis of these logs.
After the restart, browne came back on and worked properly for about an
hour, before dying again as Brion described. It is currently not
responding to ping.
-- Tim Starling