Maintenance: Rebooting ortelius web server

| timl@wolfsbane:~$ qstat -f -explain a | sed -ne '1,2p' -e '/ortelius|wolfsbane/,/^-/p' | queuename qtype resv/used/tot. load_avg arch states | --------------------------------------------------------------------------------- | short-sol@ortelius.toolserver. B 0/0/8 -NA- sol-amd64 au | error: no value for "np_load_short" because execd is in unknown state | error: no value for "np_load_avg" because execd is in unknown state | error: no value for "cpu" because execd is in unknown state | error: no value for "mem_free" because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=200M | alarm gf:available=1 load-threshold=0 | --------------------------------------------------------------------------------- | short-sol@wolfsbane.toolserver B 0/10/12 -NA- sol-amd64 au | error: no value for "np_load_short" because execd is in unknown state | error: no value for "np_load_avg" because execd is in unknown state | error: no value for "cpu" because execd is in unknown state | error: no value for "mem_free" because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=200M | alarm gf:available=1 load-threshold=0 | --------------------------------------------------------------------------------- | medium-sol@ortelius.toolserver B 0/0/4 -NA- sol-amd64 au | error: no value for "np_load_short" because execd is in unknown state | error: no value for "np_load_avg" because execd is in unknown state | error: no value for "np_load_long" because execd is in unknown state | error: no value for "cpu" because execd is in unknown state | error: no value for "mem_free" because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=100M | alarm gf:available=1 load-threshold=0 | --------------------------------------------------------------------------------- | medium-sol@wolfsbane.toolserve B 0/3/4 -NA- sol-amd64 au | error: no value for "np_load_short" because execd is in unknown state | error: no value for "np_load_avg" because execd is in unknown state | error: no value for "np_load_long" because execd is in unknown state | error: no value for "cpu" because execd is in unknown state | error: no value for "mem_free" because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=100M | alarm gf:available=1 load-threshold=0 | --------------------------------------------------------------------------------- | timl@wolfsbane:~$

Tim

Platonides

9:52 p.m.

On 09/05/13 18:57, Tim Landscheidt wrote:

...

Marlen Caemmerer marlen.caemmerer@wikimedia.de wrote:

...
I would like to reboot ortelius, one of the web servers at

...
tomorrow, Tuesday 1830 UTC

Apparently, wolfsbane rebooted today as well:

| timl@wolfsbane:~$ uptime | 16:49pm up 5:00, 2 users, load average: 1.16, 1.24, 1.47 | timl@wolfsbane:~$

Perhaps related to that, SGE queues on ortelius and wolfs- bane are in state "au" (alarm, unknown):

Yes, sge_execd seems not to be running on them.

Plus medium and longrun queues in yarrow are in error state. I tried cleaning them, but they failed again.

Tim Landscheidt

10 May 10 May

3:52 a.m.

New subject: Inodes have run out on yarrow's /var

(anonymous) wrote:

...

[...]

...

Plus medium and longrun queues in yarrow are in error state. I tried cleaning them, but they failed again.

I think I found the culprit:

With my privileges, I can't find out what's causing this. What I would look at first if I could would be /var/log/iptraf and /var/spool/postfix/*.

After fixing this, we need Nagios alerts for /var as well.

Tim

P. S.: Toolserver Office Hour + 10 days = today.

Tim Landscheidt

4:39 p.m.

New subject: Inodes have run out on yarrow's /var

I wrote:

...

...
[...]

...

...
Plus medium and longrun queues in yarrow are in error state. I tried cleaning them, but they failed again.

...

I think I found the culprit:

...

| timl@yarrow:~$ df -i /var/spool/cron/atjobs | Filesystem Inodes IUsed IFree IUse% Mounted on | /dev/mapper/yarrow0-var | 915712 915712 0 100% /var | timl@yarrow:~$

...

With my privileges, I can't find out what's causing this. What I would look at first if I could would be /var/log/iptraf and /var/spool/postfix/*.

...

[...]

Merlissimo had filed https://jira.toolserver.org/browse/TS-1649 earlier. Now someone or something has freed about 96 % of inodes on /var, but left no note, so I don't know if the queues on yarrow can be enabled again.

Tim

4257

Age (days ago)

4261

Last active (days ago)

toolserver-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Marlen Caemmerer
Platonides
Tim Landscheidt