Hello all,
today I discovered (thanks to the mailing-list) that a few users have run
several bot-instances in parallel on willow. I'm sure that these people did it
by mistake, but it is annoying nevertheless and it is easy to fix: Use SGE.
The problem is that I wrote several of eMails about "use SGE!" already and
somehow it did not work as good as it should (if you converted you stuff
already: thank you and you can stop to read here ;-)). I understand that we
all are busy with our lives and Wikipedia and that we all love to "do it
right…later", but as you know that resources of the toolserver are limited. So
I hereby declare the following new rule:
All bots have to run by SGE. A bot is every program or script that makes
changes at a Wikimedia project. It does not matter if the bot runs
periodically or continuous. The only exclusions are a.) interactive bots, b.)
bots that can't run by SGE yet and c.) if you start a bot by hand for testing
(no screen, no cron, no while).
The rule will become active at Sunday, 10. February 2013. Exception b is
almost NEVER the case, if it runs on a shell it is VERY likely that it can run
by SGE.
Some time ago I wrote a simple SGE-how-to at [1]. Maybe you all can take a
look and correct things and make things more clear. In very most cases the
using of SGE IS easy.
Sincerely,
DaB.
[1] https://wiki.toolserver.org/view/SGE_for_beginners
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello all,
great parts of the toolserver-cluster were down or very slow in the last few
hours. AFAIS it was a problem with the user-store or rosemary (where the user-
store is physically connected). I rebooted rosemary, but the reboot showed
problems with its IPv6-address. I tried to fix that what caused several other
reboots. Rosemary is now up and running but the user-store is not available
(looks like Nosy just mounted it without updating the fstab-file). So I was
forced to remove the user-store everywhere (beside on willow because it need a
reboot to do that and a reboot is scheduled already later for today).
I will try if I can find the partition for user-store and mount it but I have
not much hope (there are way to many devices to try) – just to be clear: There
is no data lost. Also away will be munin, because its data is also mounted on
that host. I fear that we have to wait for Nosy to recover before we get the
user-store back.
tl;dr: TS had problems, user-store is away.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Hello all,
while I was killing some bot-processes on willow to reduce the high load I
accidentally pressed return too early and killed a random number of processes
with that. I restarted the system-processes, but I am not sure if everything
is completely right. Just to be sure I hereby announce a reboot for tomorrow,
Monday, 19:05 UTC.
Willow will be away for some minutes. Please notice that the history shows
that cron on solaris does not start all processes during the reboot, so you
should check after the reboot if everything works. Please notice that in a few
minutes the new "no bots without SGE"-rule ([1]) becomes active, so please
make sure that your bot uses SGE or I might disable it.
I have no idea how many user-processes were killed, but I'm sorry that it did
happen nevertheless.
Sincerely,
DaB.
[1] http://lists.wikimedia.org/pipermail/toolserver-announce/2013-
January/000557.html
--
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Hello all,
for historical reasons s2 and s5 are together on one host (cassia). Because
cassia is quite overloaded, the sharing will end soon and I will move s2 away.
For this I need your help because s2 and s5 share also the user-databases and
there is not hint which user-database is needed where.
So if you use user-databases for joining with s2 (two!) please add the name of
the user-database to [1] until
Friday, 8. February 18:00 UTC.
It will take only a few minutes to add your user-databases there, so please do
it. If you do not your user-databases there your tools will break after the
split, but of course that can be fixed later.
Sincerely,
DaB.
[1] https://wiki.toolserver.org/view/User:Dab/s2-userdatabaes
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Hello all,
few days ago my mom got a telephone-call by an English speaking person
babbling something about "servers" (she was not able to understand more and
terminated the line). I'm not saying that it was one of you, but the incident
inspired me for the following declaration: *NEVER* call me by phone! Even if
you are able to get a phone-number of me and even if the toolserver is
melting: There is no reason ever to call me – if you call me and I am able to
identify you I am going to delete your account. The only exception is Nosy who
has my cell-phone-number.
If you like to contact me try IRC. If I am not online try to contact another
root so he/she can solve the problem. If you can't find another root or it has
to be me and it is important, write my an eMail. If I am not able to check my
mails the chance is VERY high that I'm can not ssh to the toolserver anyway.
Sincerely,
DaB.
--
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885