We've been receiving messages from this domain at unblock(a)toolserver.org
and they appear to be related to this:
Viral advertising for some film. In reality, it's a message with a
crapload of images attached serving no purpose for us.
Can we just block this whole domain from sending mail to toolserver
accounts? It's a nuisance, and the messages are quite large.
If you correspond with me on a regular basis, please read this document:
PGP fingerprint: 2B7A B280 8B12 21CC 260A DF65 6FCE 505A CF83 38F5
This document should be read only by those persons to whom it is
addressed. If you have received this message it was obviously addressed
to you and therefore you can read it.
Additionally, by sending an email to ANY of my addresses or to ANY
mailing lists to which I am subscribed, whether intentionally or
accidentally, you are agreeing that I am "the intended recipient," and
that I may do whatever I wish with the contents of any message received
from you, unless a pre-existing agreement prohibits me from so doing.
This overrides any disclaimer or statement of confidentiality that may
be included on your message.
for those of you not having seen TS-1553, mail forwarding
seems to have stopped working. So if you haven't received
the usual job reports that you were expecting, you might
want to login to all servers and check if there is mail for
you. You can query all servers by:
| for SERVER in clematis hawthorn nightshade ortelius willow wolfsbane yarrow; do
| ssh $USER(a)$SERVER.toolserver.org ls -l /var/mail/$USER
replacing $USER with your username.
I'm trying go through some issues on JIRA and it keeps logging me out every few minutes.
At some point I even logged in, clicked an issue, clicked Edit (which uses AJAX) and then the Edit screen wouldn't load due to me not being authenticated (while I still saw my nickname on the top right).
as some of you might have noticed s7 is badly corrupted.
Before I go into the details I can say we will most likely have to resetup s7 due to a innodb failure and I need to find a workaround until we have the data in place which can take days or even weeks.
Now the details - probably some of you with database knowledge might have an idea or say something to my idea of a workaround.
The replication failed several times in the past days and I did not know why.
I simply skipped the slave query when it failed and then replication ran again.
Today the database process restarted without a slave query failing repeatedly.
I had a close look and came to the idea that a broken transaction in the transaction log made it break.
So I stopped the transaction from being played into the mysql db when the database restarts by setting innodb_force_recovery = 3.
Ok, fine. MySQL then starts. Cool. But the slave process wont run in this mode so we dont have the new data. Hm.
So I tried to throw away the broken transaction.
I moved the iblog-files and started mysql again.
MySQL failed to come up then telling me:
121116 11:40:28 InnoDB: Error: page 7 log sequence number 270 492619208
InnoDB: is in the future! Current system log sequence number 268 2967383564.
InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
Ok. MySQL does not come up then and repeatedly restarts. No luck. Copied the log files back. Fine. Works again.
Now I tried several thing to check which table might be corrupted.
Innodbchecksum reported everything fine.
Mysqlcheck crashed the mysql daemon when accessing centralauth.localnames.
Oh? Why? Checking the table again crashes mysql. Hm.
Tried a repair table - "storage engine does not support this"...hm.
The log says
InnoDB: Page lsn 268 3672100478, low 4 bytes of lsn at page end 3672100478
InnoDB: Page number (if stored to page already) 192520,
InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 428
InnoDB: Page may be an index page where index id is 0 1174
InnoDB: (index "PRIMARY" of table "centralauth"."localnames")
InnoDB: Error in page 192520 of index "PRIMARY" of table "centralauth"."localnames"
121116 13:02:46 - mysqld got signal 11 ;
I tried to remove the index to rebuild it but this does not work due to innodb_force_recovery = 3.
Mysqldump fails - crashes the mysql daemon too.
So I dont have any more idea how to fix this error.
Now I thought if we have to resetup I could drop the table completely and start mysql normal mode so replication works again.
This would only mean s7 would lack this table until it is resetup.
What do you think about this?
Any more ideas?
I just got back from the general member meeting of Wikimedia Deutschland. As
you know I requested a decision about the future of the toolserver there. To
make it short: It doesn't went as well as I hoped. While the request itself
was accepted, it was changed in some important parts.
The main fear was that WMF could stop to provide us with fresh dumps and/or
replication in near future, making the toolserver more or less useless.
Although I learned from a participating WMF-board-member that no such board-
My request was changed in the following way: The WMF has to tell WMDE within 6
months how Wikilabs can replace the toolserver in the promised complete way.
If the answer is not satisfying, WMDE will develop a "Governance-Model" to
ensure the continuation of the toolserver. Different groups are invited into
this "Governance-Model" and it should be done until the end of 2013.
That sounds good on the first view, but there are 2 loop-holes: Nobody defined
what "complete" or "satisfying" is. In my eyes Wikilabs can not replace the
toolserver complete (in the way that all tools can move to there) and so the
answer can only be unsatisfying, but that's just a question of definition I
A second change was that the investment for the toolserver will be restricted
to the "necessary". While that is of course a matter of definition again I'm
sure that means "no new hardware if it is possible in any way".
To summarize this: In the best case we have to wait for 6 months until WMDE
officially learns that Wikilabs can not replace us, than wait for another 6
months until they will create their "Governance-Model" and in 2014 we get new
In worst case we wait for 6 months and than WMDE and WMF agree that everything
is ok and we will never get any new hardware and somewhen the TS will shut
down (of course with the remaining tools that can not migrated to Wikilabs).
I can not imagine ways between both cases, but I'm sure they exists. In any
way we will get no (or nearly no) new hardware in 2013 – so we have to life
A good news is that the toolserver will get 3 new database-servers soon.
I have not decided yet if I will remain as root under this circumstances for
2013 – I will tell you my decision until next Sunday.
For now I will head to bed because I'm exhausted and disappointed. See you
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
I'm currently running a test of one of SuggestBot's scripts and noticed
that sometimes there has been two jobs running almost in parallel, which
surprised me since I only have a single cron-job. Maybe there's an error
in my setup that's causing this, or maybe it's just a glitch in the Matrix?
Since I have no idea if it's the former, I'd be happy if someone had any
idea about what's been causing this.
Here's the crontab entry from the submit server:
28 * * * * cronsub nettasks $HOME/SuggestBot/opentask/opentasks-nettrom.sh
So far there's been eight incidents of duplicate jobs:
jobnumber 837874, qsub_time Tue Nov 27 21:28:02 2012
jobnumber 837876, qsub_time Tue Nov 27 21:28:03 2012
jobnumber 841731, qsub_time Wed Nov 28 13:28:01 2012
jobnumber 841734, qsub_time Wed Nov 28 13:28:03 2012
jobnumber 844796, qsub_time Thu Nov 29 01:28:01 2012
jobnumber 844797, qsub_time Thu Nov 29 01:28:02 2012
jobnumber 845829, qsub_time Thu Nov 29 05:28:01 2012
jobnumber 845830, qsub_time Thu Nov 29 05:28:01 2012
jobnumber 846093, qsub_time Thu Nov 29 06:28:01 2012
jobnumber 846095, qsub_time Thu Nov 29 06:28:03 2012
jobnumber 846351, qsub_time Thu Nov 29 07:28:01 2012
jobnumber 846354, qsub_time Thu Nov 29 07:28:03 2012
jobnumber 847126, qsub_time Thu Nov 29 10:28:02 2012
jobnumber 847128, qsub_time Thu Nov 29 10:28:03 2012
jobnumber 848150, qsub_time Thu Nov 29 14:28:01 2012
jobnumber 848151, qsub_time Thu Nov 29 14:28:02 2012
Here's the shell script that's launched by cron:
nettrom@willow:~$ less SuggestBot/opentask/opentasks-nettrom.sh
# Name the job "opentasks".
#$ -N nettasks
# Tell the server we'll be running for a maximum of 55 minutes (default is
#$ -l h_rt=00:55:00
# Join STDERR and STDOUT
#$ -j y
# Store output in a different place.
#$ -o $HOME/SuggestBot/logs/opentasks-nettrom.log
# Ask for 256MB of memory
#$ -l virtual_free=256M
# Need 1 SQL process on s1-rr for enwiki, and 1 SQL process on sql-user-n
#$ -l sql-s1-rr=1
#$ -l sql-user-n=1
# Until oursql is available on the Linux hosts, we have to restrict this to
# or rewrite it to use MySQLdb.
#$ -l arch=sol
# Engage virtualenv
# Make sure my local modules work
# Post to my userspace with 5x oversampling, pointing to the right
classifier host file
python $HOME/SuggestBot/opentask/opentasks.py -o 5
--page="User:Nettrom/sandbox/opentask" -l en \
I sent my announcement-eMail (see below) to the wrong mail-address yesterday.
So here a postmortem: The dumping and importing worked fine, the replication
was stopped for ~40min; the replag was cleared short time later. I also
imported wikidata at thyme and started replication there too.
At Freitag 30 November 2012E 13:03:12 you wrote:
> Hello all,
> cassia's copy of wikidata is defect. Because wikidata is still quite small
> and getting a dump from the WMF usually takes some times, I will create
> the dump myself from daphne (sql-s4) this time. For this I will stop the
> wikidata- replication there
> today, 19:20 UTC.
> All other replication will continue to run. The impact on the performance
> of daphne should be small and you should notice no problems.
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
the HA-node turnera needs a reboot quite badly (it is very low on memory). It
should be no problem to reboot it (because the other node (damiana) should
overtake everything), but you can never be to sure enough. Therefor I hereby
announce a downtime-window of 15 minutes for tomorrow
Tuesday, 27th, 21:15 UTC.
Normally you should notice nothing, but in worst case the toolserver will be
down/unavailable until turnera rebooted completely.
When the reboot of turnera is done, I will also reboot nightshade which has a
very high load since we moved the stuff from thyme to rosemary. This also
should not take longer than 15 minutes. Please make sure to have no open files
on this box around the reboot.
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885