Toolserver-l November 2012

toolserver-l@lists.wikimedia.org

29 participants
23 discussions

Jira Auth broken
by Marlen Caemmerer 20 Mar '13

20 Mar '13

Hello, we have an issue with Jira authentication since 25th August. It seems the syncronisation with the crowd server is broken but I dont know why and filed a bug at Atlassian. Cheers Marlen

4 6

Block mail from dysgo.org?
by Christopher David Howie 17 Feb '13

17 Feb '13

We've been receiving messages from this domain at unblock(a)toolserver.org and they appear to be related to this: <http://www.reddit.com/r/WTF/comments/l44r1/i_just_got_this_email_at_work_i_…>. Viral advertising for some film. In reality, it's a message with a crapload of images attached serving no purpose for us. Can we just block this whole domain from sending mail to toolserver accounts? It's a nuisance, and the messages are quite large. -- Chris Howie http://www.chrishowie.com http://en.wikipedia.org/wiki/User:Crazycomputers If you correspond with me on a regular basis, please read this document: http://www.chrishowie.com/email-preferences/ PGP fingerprint: 2B7A B280 8B12 21CC 260A DF65 6FCE 505A CF83 38F5 ------------------------------------------------------------------------ IMPORTANT INFORMATION/DISCLAIMER This document should be read only by those persons to whom it is addressed. If you have received this message it was obviously addressed to you and therefore you can read it. Additionally, by sending an email to ANY of my addresses or to ANY mailing lists to which I am subscribed, whether intentionally or accidentally, you are agreeing that I am "the intended recipient," and that I may do whatever I wish with the contents of any message received from you, unless a pre-existing agreement prohibits me from so doing. This overrides any disclaimer or statement of confidentiality that may be included on your message.

3 3

Mail forwarding not working
by Tim Landscheidt 04 Feb '13

04 Feb '13

Hi, for those of you not having seen TS-1553, mail forwarding seems to have stopped working. So if you haven't received the usual job reports that you were expecting, you might want to login to all servers and check if there is mail for you. You can query all servers by: | for SERVER in clematis hawthorn nightshade ortelius willow wolfsbane yarrow; do | ssh $USER(a)$SERVER.toolserver.org ls -l /var/mail/$USER | done replacing $USER with your username. Tim

5 10

JIRA session loss
by Krinkle 17 Dec '12

17 Dec '12

Hi, I'm trying go through some issues on JIRA and it keeps logging me out every few minutes. At some point I even logged in, clicked an issue, clicked Edit (which uses AJAX) and then the Edit screen wouldn't load due to me not being authenticated (while I still saw my nickname on the top right). -- Krinkle

9 12

Defective database s7
by Marlen Caemmerer 06 Dec '12

06 Dec '12

Hello, as some of you might have noticed s7 is badly corrupted. Before I go into the details I can say we will most likely have to resetup s7 due to a innodb failure and I need to find a workaround until we have the data in place which can take days or even weeks. Now the details - probably some of you with database knowledge might have an idea or say something to my idea of a workaround. The replication failed several times in the past days and I did not know why. I simply skipped the slave query when it failed and then replication ran again. Today the database process restarted without a slave query failing repeatedly. I had a close look and came to the idea that a broken transaction in the transaction log made it break. So I stopped the transaction from being played into the mysql db when the database restarts by setting innodb_force_recovery = 3. Ok, fine. MySQL then starts. Cool. But the slave process wont run in this mode so we dont have the new data. Hm. So I tried to throw away the broken transaction. I moved the iblog-files and started mysql again. MySQL failed to come up then telling me: 121116 11:40:28 InnoDB: Error: page 7 log sequence number 270 492619208 InnoDB: is in the future! Current system log sequence number 268 2967383564. InnoDB: Your database may be corrupt or you may have copied the InnoDB InnoDB: tablespace but not the InnoDB log files. See InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-innodb-recovery.html Ok. MySQL does not come up then and repeatedly restarts. No luck. Copied the log files back. Fine. Works again. Now I tried several thing to check which table might be corrupted. Innodbchecksum reported everything fine. Mysqlcheck crashed the mysql daemon when accessing centralauth.localnames. Oh? Why? Checking the table again crashes mysql. Hm. Tried a repair table - "storage engine does not support this"...hm. The log says InnoDB: Page lsn 268 3672100478, low 4 bytes of lsn at page end 3672100478 InnoDB: Page number (if stored to page already) 192520, InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 428 InnoDB: Page may be an index page where index id is 0 1174 InnoDB: (index "PRIMARY" of table "centralauth"."localnames") InnoDB: Error in page 192520 of index "PRIMARY" of table "centralauth"."localnames" 121116 13:02:46 - mysqld got signal 11 ; I tried to remove the index to rebuild it but this does not work due to innodb_force_recovery = 3. Mysqldump fails - crashes the mysql daemon too. So I dont have any more idea how to fix this error. Now I thought if we have to resetup I could drop the table completely and start mysql normal mode so replication works again. This would only mean s7 would lack this table until it is resetup. What do you think about this? Any more ideas? Cheers Marlen/nosy

4 7

Result of the general member meeting of WMDE
by DaB. 04 Dec '12

04 Dec '12

Hello all, I just got back from the general member meeting of Wikimedia Deutschland. As you know I requested a decision about the future of the toolserver there. To make it short: It doesn't went as well as I hoped. While the request itself was accepted, it was changed in some important parts. The main fear was that WMF could stop to provide us with fresh dumps and/or replication in near future, making the toolserver more or less useless. Although I learned from a participating WMF-board-member that no such board- decision exists. My request was changed in the following way: The WMF has to tell WMDE within 6 months how Wikilabs can replace the toolserver in the promised complete way. If the answer is not satisfying, WMDE will develop a "Governance-Model" to ensure the continuation of the toolserver. Different groups are invited into this "Governance-Model" and it should be done until the end of 2013. That sounds good on the first view, but there are 2 loop-holes: Nobody defined what "complete" or "satisfying" is. In my eyes Wikilabs can not replace the toolserver complete (in the way that all tools can move to there) and so the answer can only be unsatisfying, but that's just a question of definition I guess. A second change was that the investment for the toolserver will be restricted to the "necessary". While that is of course a matter of definition again I'm sure that means "no new hardware if it is possible in any way". To summarize this: In the best case we have to wait for 6 months until WMDE officially learns that Wikilabs can not replace us, than wait for another 6 months until they will create their "Governance-Model" and in 2014 we get new hardware. In worst case we wait for 6 months and than WMDE and WMF agree that everything is ok and we will never get any new hardware and somewhen the TS will shut down (of course with the remaining tools that can not migrated to Wikilabs). I can not imagine ways between both cases, but I'm sure they exists. In any way we will get no (or nearly no) new hardware in 2013 – so we have to life with that. A good news is that the toolserver will get 3 new database-servers soon. I have not decided yet if I will remain as root under this circumstances for 2013 – I will tell you my decision until next Sunday. For now I will head to bed because I'm exhausted and disappointed. See you tomorrow. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885

6 7

S1
by John 03 Dec '12

03 Dec '12

do we have an updated ETA on a non-corrupt s1?

3 5

Multiple jobs started, SGE or cron issue?
by Morten Wang 30 Nov '12

30 Nov '12

I'm currently running a test of one of SuggestBot's scripts and noticed that sometimes there has been two jobs running almost in parallel, which surprised me since I only have a single cron-job. Maybe there's an error in my setup that's causing this, or maybe it's just a glitch in the Matrix? Since I have no idea if it's the former, I'd be happy if someone had any idea about what's been causing this. Here's the crontab entry from the submit server: 28 * * * * cronsub nettasks $HOME/SuggestBot/opentask/opentasks-nettrom.sh So far there's been eight incidents of duplicate jobs: jobnumber 837874, qsub_time Tue Nov 27 21:28:02 2012 jobnumber 837876, qsub_time Tue Nov 27 21:28:03 2012 jobnumber 841731, qsub_time Wed Nov 28 13:28:01 2012 jobnumber 841734, qsub_time Wed Nov 28 13:28:03 2012 jobnumber 844796, qsub_time Thu Nov 29 01:28:01 2012 jobnumber 844797, qsub_time Thu Nov 29 01:28:02 2012 jobnumber 845829, qsub_time Thu Nov 29 05:28:01 2012 jobnumber 845830, qsub_time Thu Nov 29 05:28:01 2012 jobnumber 846093, qsub_time Thu Nov 29 06:28:01 2012 jobnumber 846095, qsub_time Thu Nov 29 06:28:03 2012 jobnumber 846351, qsub_time Thu Nov 29 07:28:01 2012 jobnumber 846354, qsub_time Thu Nov 29 07:28:03 2012 jobnumber 847126, qsub_time Thu Nov 29 10:28:02 2012 jobnumber 847128, qsub_time Thu Nov 29 10:28:03 2012 jobnumber 848150, qsub_time Thu Nov 29 14:28:01 2012 jobnumber 848151, qsub_time Thu Nov 29 14:28:02 2012 Here's the shell script that's launched by cron: nettrom@willow:~$ less SuggestBot/opentask/opentasks-nettrom.sh #!/bin/bash # Name the job "opentasks". #$ -N nettasks # Tell the server we'll be running for a maximum of 55 minutes (default is 6hrs) #$ -l h_rt=00:55:00 # Join STDERR and STDOUT #$ -j y # Store output in a different place. #$ -o $HOME/SuggestBot/logs/opentasks-nettrom.log # Ask for 256MB of memory #$ -l virtual_free=256M # Need 1 SQL process on s1-rr for enwiki, and 1 SQL process on sql-user-n for logging #$ -l sql-s1-rr=1 #$ -l sql-user-n=1 # Until oursql is available on the Linux hosts, we have to restrict this to Solaris, # or rewrite it to use MySQLdb. #$ -l arch=sol # Engage virtualenv source $HOME/env/default/bin/activate # Make sure my local modules work export PYTHONPATH=$HOME/lib/python # Post to my userspace with 5x oversampling, pointing to the right classifier host file python $HOME/SuggestBot/opentask/opentasks.py -o 5 --page="User:Nettrom/sandbox/opentask" -l en \ -f $HOME/SuggestBot/classifier/hostname.txt Regards, Morten

2 3

Postmortem: Fresh import of wikidata at cassia and thyme
by DaB. 30 Nov '12

30 Nov '12

Hello all, I sent my announcement-eMail (see below) to the wrong mail-address yesterday. So here a postmortem: The dumping and importing worked fine, the replication was stopped for ~40min; the replag was cleared short time later. I also imported wikidata at thyme and started replication there too. Sincerely, DaB. At Freitag 30 November 2012E 13:03:12 you wrote: > Hello all, > > cassia's copy of wikidata is defect. Because wikidata is still quite small > and getting a dump from the WMF usually takes some times, I will create > the dump myself from daphne (sql-s4) this time. For this I will stop the > wikidata- replication there > > today, 19:20 UTC. > > All other replication will continue to run. The impact on the performance > of daphne should be small and you should notice no problems. > > Sincerely, > DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885

1 0

Reboot of nightshade and turnera tomorrow
by DaB. 27 Nov '12

27 Nov '12

Hello all, the HA-node turnera needs a reboot quite badly (it is very low on memory). It should be no problem to reboot it (because the other node (damiana) should overtake everything), but you can never be to sure enough. Therefor I hereby announce a downtime-window of 15 minutes for tomorrow Tuesday, 27th, 21:15 UTC. Normally you should notice nothing, but in worst case the toolserver will be down/unavailable until turnera rebooted completely. When the reboot of turnera is done, I will also reboot nightshade which has a very high load since we moved the stuff from thyme to rosemary. This also should not take longer than 15 minutes. Please make sure to have no open files on this box around the reboot. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885

1 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Toolserver-l November 2012