Hello all,
during the last week I kept an eye on the database-servers and killed queries which had run for too long. In the past we had a programm for that, but somewhen it stoped working and nobody fixed it AFAIK.
Because watching the mytop is a boring thing and I have to sleep too, I used the last 2-3 days to write a substitute of the old query-kiler. This program is working now. It does more or less the same as I did, but it send you an eMail when it killed one of your query. I think that is an improvment and I can spend my time with other things.
You can find the parameters of when and if the killer works at [1]. Of corse there is always the possiblity of adjustments. Feel free to open bugs at JIRA or discuss on teh mailinglist.
Sincerly, DaB.
[1] https://wiki.toolserver.org/view/Database_access#Slow_queries_and_the_query_...
On Sun, Aug 14, 2011 at 5:42 PM, DaB. WP@daniel.baur4.info wrote:
Hello all,
during the last week I kept an eye on the database-servers and killed queries which had run for too long. In the past we had a programm for that, but somewhen it stoped working and nobody fixed it AFAIK.
Because watching the mytop is a boring thing and I have to sleep too, I used the last 2-3 days to write a substitute of the old query-kiler. This program is working now. It does more or less the same as I did, but it send you an eMail when it killed one of your query. I think that is an improvment and I can spend my time with other things.
You can find the parameters of when and if the killer works at [1]. Of corse there is always the possiblity of adjustments. Feel free to open bugs at JIRA or discuss on teh mailinglist.
So the max run-time is dependent on replag. But, considering this: Replag s3 6050h 10m 44s
I currently get bombarded with killed-query-mails :-(
Magnus
Hello, At Sunday 14 August 2011 20:24:10 DaB. wrote:
So the max run-time is dependent on replag. But, considering this: Replag s3 6050h 10m 44s
I guess you have that replag-value by brian. I have no idea, how he got to this value (I think he uses a non-avaible database), but the replag is MUCH lower (arround 1h at the moment). You find the replag of the kill-time in the eMail BTW.
I currently get bombarded with killed-query-mails
I'm sorry for that. But if I look in the log I see either CREATE-, INSERTS- or DELETE-queries by you without SLOW_OK (a bad idea) or SELECTS which had run for over 30 minutes (the user will be long gone). So please add SLOW_OKs to the CREATE, INSERT, UPDATE, REPLACE etc.pp.-queries and a reasonable LIMIT: to the SELECTs (and try to optimize them) and the mail-flood will ebbing. If you run SELECTs which will take long (from the command-line for example or from SGE) please add also an SLOW_OK – don't do this for webqueries!
Sincerly, DaB.
On Sun, Aug 14, 2011 at 8:33 PM, DaB. WP@daniel.baur4.info wrote:
Hello, At Sunday 14 August 2011 20:24:10 DaB. wrote:
So the max run-time is dependent on replag. But, considering this: Replag s3 6050h 10m 44s
I guess you have that replag-value by brian. I have no idea, how he got to this value (I think he uses a non-avaible database), but the replag is MUCH lower (arround 1h at the moment).
I was using eswiki, but apparently that got split off to s7 months ago. Does anybody have a good suggestion what wiki to use for s3? dawiki is the largest wikipedia on it, but I think it will have very little edits during the night in Europe. I've currently set it to simplewiki, but that isn't a very active one either.
Bryan
At Sunday 14 August 2011 21:10:26 DaB. wrote:
Does anybody have a good suggestion what wiki to use for s3? dawiki is the largest wikipedia on it, but I think it will have very little edits during the night in Europe.
I use dawiki and the hightest replag in the last 24h was 1978s. That should be more or less correct.
Sincerly, DaB.
On Sun, Aug 14, 2011 at 8:12 PM, DaB. WP@daniel.baur4.info wrote:
At Sunday 14 August 2011 21:10:26 DaB. wrote:
Does anybody have a good suggestion what wiki to use for s3? dawiki is the largest wikipedia on it, but I think it will have very little edits during the night in Europe.
I use dawiki and the hightest replag in the last 24h was 1978s. That should be more or less correct.
Update: I seem to be getting the vast majority of mails about "z-dat-s4-a". Replag is <1000s but it seems strange that that one shows up so predominantly.
Quick question, is "Unable to run job: got no response from JSV script "/sge62/default/common/jsv.sh" " The killer? becuase ya, I got a lot of them, came home after about 5 hours to about 8 of them, and my API queries are quick...I don't get (unless it's the server) what i'm doing wrong.
--DQ
On Mon, Aug 15, 2011 at 1:30 AM, DeltaQuad Wikipedia deltaquadwiki@gmail.com wrote:
Quick question, is "Unable to run job: got no response from JSV script "/sge62/default/common/jsv.sh" " The killer? becuase ya, I got a lot of them, came home after about 5 hours to about 8 of them, and my API queries are quick...I don't get (unless it's the server) what i'm doing wrong.
It's not 'the killer', but it might be related, because I got those as well, first time ever:
Unable to run job: got no response from JSV script "/sge62/default/common/jsv.sh".
Magnus
Is this script somewhere aivable? I would like to use it also outside the toolserver.,
2011/8/15, Magnus Manske magnusmanske@googlemail.com:
On Mon, Aug 15, 2011 at 1:30 AM, DeltaQuad Wikipedia deltaquadwiki@gmail.com wrote:
Quick question, is "Unable to run job: got no response from JSV script "/sge62/default/common/jsv.sh" " The killer? becuase ya, I got a lot of them, came home after about 5 hours to about 8 of them, and my API queries are quick...I don't get (unless it's the server) what i'm doing wrong.
It's not 'the killer', but it might be related, because I got those as well, first time ever:
Unable to run job: got no response from JSV script "/sge62/default/common/jsv.sh".
Magnus
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Op 14-8-2011 18:42, DaB. schreef:
<knip>
[1] https://wiki.toolserver.org/view/Database_access#Slow_queries_and_the_query_...
Ah, you are the one who killed the categorization bot. Thanks for announcing that beforehand (not!). It's very annoying that you suddenly just decide to deploy a new toy the screws up our tools.
Maarten
On Tue, Aug 16, 2011 at 7:03 AM, Maarten Dammers maarten@mdammers.nl wrote:
Ah, you are the one who killed the categorization bot. Thanks for announcing that beforehand (not!). It's very annoying that you suddenly just decide to deploy a new toy the screws up our tools.
Maarten
It was re-enabling a service that has been around for a long time that [accidently] broke at one stage and because of the TS performance not being degraded was never really noticed or cared about.
Recently a TS box went down and caused major issues (Replag was easily over 2 days at one stage for those databases) for the backup box handling the effected databases, DaB was monitoring this by hand and manually killing the processes to keep this somewhat saner then re-wrote and brought the tool back online to handle this when they noticed it was [the tool] was not functioning.
This was a tool that was already meant to be running but by [your] "luck" wasn't, and was already documented on wiki as DaB pointed out over a year ago.
Perhaps you should consider apologizing to DaB in regards to the tone of your email.
On Mon, Aug 15, 2011 at 11:39 PM, K. Peachey p858snake@gmail.com wrote:
On Tue, Aug 16, 2011 at 7:03 AM, Maarten Dammers maarten@mdammers.nl wrote:
Ah, you are the one who killed the categorization bot. Thanks for announcing that beforehand (not!). It's very annoying that you suddenly just decide to deploy a new toy the screws up our tools.
Maarten
It was re-enabling a service that has been around for a long time that [accidently] broke at one stage and because of the TS performance not being degraded was never really noticed or cared about.
Recently a TS box went down and caused major issues (Replag was easily over 2 days at one stage for those databases) for the backup box handling the effected databases, DaB was monitoring this by hand and manually killing the processes to keep this somewhat saner then re-wrote and brought the tool back online to handle this when they noticed it was [the tool] was not functioning.
This was a tool that was already meant to be running but by [your] "luck" wasn't, and was already documented on wiki as DaB pointed out over a year ago.
Perhaps you should consider apologizing to DaB in regards to the tone of your email.
While it's wrong do insult DaB (or anyone here :-) I feel the annoyance. It doesn't really matter if it was "supposed to" be running but wasn't for some reason we mere mortals can't comprehend - queries were fine, and then they suddenly they were not. I understand there are reasons behind this (hardware failure, miser mode), but they weren't obvious from the original mail.
So now I (and others, I suspect) are inserting /* SLOW_OK */ into queries. That doesn't make the queries less or faster, but it returns service to normal (for the tool writers, and the tool users). German speakers might be familiar with the phrase "wenn's denn der Wahrheitsfindung dient..."
Should we (the tool authors) make our queries faster? Absolutely. Can we? Sometimes, maybe. Usually not. That might be due to personal qualification, or system constraints. The local version of the "work smarter, not harder" mantra is unlikely to improve the situation.
Cheers, Magnus
Hello, At Tuesday 16 August 2011 23:12:09 DaB. wrote:
So now I (and others, I suspect) are inserting /* SLOW_OK */ into queries. That doesn't make the queries less or faster, but it returns service to normal (for the tool writers, and the tool users). German speakers might be familiar with the phrase "wenn's denn der Wahrheitsfindung dient..."
of corse I considered that users will start inserting SLOW_OK just everywhere before I re-wrote the killer. Let me tell you why that is not so bad as it seems: Most user will not do this because their queries are executed fast enough already (and so the killer ignores them). Some users will add a SLOW_OK, but set a LIMIT: too (that is ok, because the ideas behind SLOW_OK is that users say "I know that will runs longer"). Some user will just add a SLOW_OK because they know that the queries will run for longer (that's ok too). Only very few users will add SLOW_OK to queries which could run fast; and these few queries I can kill manualy and then write angry mails to the users.
The query-killer should kill 3 types of queries: -Queries which have a LIMIT: and have run too long, -Queries which run normaly very fast, but for some reason not this time (like an editcounter which works fast for people with low and middle editcounts, but fails for a bot with 100,000 edits) -Queries of users who do not think.
If you know that a query will run for longer (because it is complex, because it will return many rows, or something else), then add SLOW_OK (and a LIMIT: if you can) and everything is fine! If you know that the query should run fast, then please set a LIMIT: and look for it if you get too many emails, because something is wrong then.
I guess most times (when the replag is low) most users will get no eMails at all (1h is a long time).
Sincerly, DaB.
P.S: And yes, I can understand if somebody is angry if his/her query is killed. But please try to understand my side too.
toolserver-l@lists.wikimedia.org