performance issues

List overview All Threads
Download

newer

older

CrontabCr

user-store and TS-1230

Ja Ga

26 Oct 2011 26 Oct '11

10:50 a.m.

So, is anyone else seeing really horrible mysql performance on s1? I heard a mysql upgrade introduced regressions, but I'm seeing queries take 2x - 3x as long, sometimes worse.

- Jason

Attachments:

attachment.htm (text/html — 309 bytes)

Show replies by date

Petr Onderka

26 Oct 26 Oct

1:01 p.m.

Hi,

I noticed unusually high replag on s1 in the past week, that's probably related.

But I didn't hear about any any recent upgrades. They are usually announced, because they cause some downtime.

Petr Onderka [[en:User:Svick]]

Carl (CBM)

1:29 p.m.

On Wed, Oct 26, 2011 at 9:50 AM, Ja Ga jaga_x_1@yahoo.com wrote:

...

So, is anyone else seeing really horrible mysql performance on s1? I heard a mysql upgrade introduced regressions, but I'm seeing queries take 2x - 3x as long, sometimes worse.

Yes. On the 25th, the following query of mine was killed after 1700 seconds:

SELECT ns_id, ns_name FROM toolserver.namespacename where (ns_type = 'canonical' or ns_type = 'primary') and dbname = 'enwiki_p'

That should be an extremely fast query; there is an index on dbname. I reported similar problems a little while ago on this list. I get many other notices about killed queries that should not have problems.

- Carl

Krinkle

3:33 p.m.

On Wed, Oct 26, 2011 at 6:29 PM, Carl (CBM) cbm.wikipedia@gmail.com wrote:

...

the following query of mine was killed after 1700 seconds:

SELECT ns_id, ns_name FROM toolserver.namespacename where (ns_type = 'canonical' or ns_type = 'primary') and dbname = 'enwiki_p'

That should be an extremely fast query; there is an index on dbname.

Just a little sidenote, you may be interested in ns_is_favorite. There is one entry per namespace per dbname where `ns_is_favorite = 1` which is also the one used by the wiki when creating/redirecting native links.

- Krinkle

Marlen Caemmerer

5:44 p.m.

Hello,

I am not sure if I can already be helpful since I am new to the environment but I have some strange findings in the monitoring graphs.

First I'd be interested in the MySQL host you ask. Then: this is the graph. http://munin.toolserver.org/Database/rosemary/mysql_bytes.html Rosemary (one of the enwiki-DB-Hosts) seems to bring the maximum of the I/O that is possible, disk graphs are clipping there. In the MySQL traffic graph you can see there is clipping too. Strange thing about this is that this phenomenon started in the middle of september. Can anyone remember any important change in this time? That is not the case with thyme (other enwiki DB Host) so it'd be a question wether we could share the load better. Also I enabled query caching and will see if this is useful in any way.

Another question would be: which queries are especially slow? Do you have special trouble with joins?

Cheers nosy

-- * Marlen Caemmerer * Richard-Sorge-Str. 82 monoro * 10249 Berlin * * Tel: 0179/733 90 72 USt-ID: DE 252684276

Platonides

6:18 p.m.

Marlen Caemmerer wrote:

...

Rosemary (one of the enwiki-DB-Hosts) seems to bring the maximum of the I/O that is possible, disk graphs are clipping there. In the MySQL traffic graph you can see there is clipping too. Strange thing about this is that this phenomenon started in the middle of september. Can anyone remember any important change in this time? That is not the case with thyme (other enwiki DB Host) so it'd be a question wether we could share the load better. Also I enabled query caching and will see if this is useful in any way.

Maybe there's some job/tool which started at that time which loaded rosemary so much?

Magnus Manske

29 Oct 29 Oct

8:25 a.m.

On Wed, Oct 26, 2011 at 10:18 PM, Platonides platonides@gmail.com wrote:

...

Marlen Caemmerer wrote:

...
Rosemary (one of the enwiki-DB-Hosts) seems to bring the maximum of the I/O that is possible, disk graphs are clipping there. In the MySQL traffic graph you can see there is clipping too. Strange thing about this is that this phenomenon started in the middle of september. Can anyone remember any important change in this time? That is not the case with thyme (other enwiki DB Host) so it'd be a question wether we could share the load better. Also I enabled query caching and will see if this is useful in any way.

Maybe there's some job/tool which started at that time which loaded rosemary so much?

In the last three hours, I got no less than 21 immensely useless mails from the query killer. The following query, apparently, was killed after running for a whooping 71 minutes:

SELECT /* SLOW_OK */ /* GLAMOROUS */ gil_wiki,gil_page_title,gil_page_namespace,gil_to from globalimagelinks,page,categorylinks where gil_to=page_title and cl_to ="Images_from_the_National_Archives_and_Records_Administration" AND page_id=cl_from AND page_namespace=6 AND gil_page_namespace=""

This queries, and others like it, have not caused problems for the last one (two? I can't remember) years they've been in place. Something has changed for the far, far worse.

Magnus

Christian Thiele

8:30 a.m.

Hi,

Am 29.10.2011, 13:25 Uhr, schrieb Magnus Manske magnusmanske@googlemail.com:

...

In the last three hours, I got no less than 21 immensely useless mails from the query killer. The following query, apparently, was killed after running for a whooping 71 minutes:

...

This queries, and others like it, have not caused problems for the last one (two? I can't remember) years they've been in place. Something has changed for the far, far worse.

for me the same. I didn't change anything, but most queries are much slower than before (I got the first mails from the query killer at around 4 pm UTC yesterday).

Chris / apper

DaB.

10:18 a.m.

Hello, At Saturday 29 October 2011 15:04:30 DaB. wrote:

...

This queries, and others like it, have not caused problems for the last one (two? I can't remember) years they've been in place. Something has changed for the far, far worse.

have you ever thought about the possiblity that maybe there are more users on the TS and more people who use the toolserver now than 2 years ago? Or maybe it is just people like you, who let a query for a WEBTOOL run for 71 minutes!

We are just short on hardware at the moment, and the situation will not change for some time, so we all have to live with the things we have and if a tool breaks because it consumes too much resources, then the dev. has to look if he/she can optimize it or it breaks!

Sincerly, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885

Platonides

10:54 a.m.

DaB. schrieb:

...

have you ever thought about the possiblity that maybe there are more users on the TS and more people who use the toolserver now than 2 years ago? Or maybe it is just people like you, who let a query for a WEBTOOL run for 71 minutes!

We are just short on hardware at the moment, and the situation will not change for some time, so we all have to live with the things we have and if a tool breaks because it consumes too much resources, then the dev. has to look if he/she can optimize it or it breaks!

Sincerly, DaB.

Something has changed very recently. I had occasionally received a cron error of

...

Unable to run job: got no response from JSV script "/sge62/default/common/jsv.sh".

But since ~24 hours ago I got many of those. I don't know what may be wrong (the target server too slow to answer the RPC? a flood of jobs migrated to SGE?) but I'm a bit concerned that we have redundants servers to ensure the crons registering to SGE are run, yet the registration may fail with pretty much no reason. Maybe we should start launching locally if cronsub fails.

DaB.

11:25 a.m.

Hello, At Saturday 29 October 2011 16:24:30 DaB. wrote:

...

Something has changed very recently. I had occasionally received a cron error of

...
Unable to run job: got no response from JSV script "/sge62/default/common/jsv.sh".

we try to find the reason for this at the moment.

Sincerly, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885

Magnus Manske

11:18 a.m.

On Sat, Oct 29, 2011 at 2:18 PM, DaB. WP@daniel.baur4.info wrote:

...

Hello, At Saturday 29 October 2011 15:04:30 DaB. wrote:

...
This queries, and others like it, have not caused problems for the last one (two? I can't remember) years they've been in place. Something has changed for the far, far worse.

have you ever thought about the possiblity that maybe there are more users on the TS and more people who use the toolserver now than 2 years ago?

Unless there was an enormous influx of users in the last 12 hours or so, this is not really an explanation.

...

Or maybe it is just people like you, who let a query for a WEBTOOL run for 71 minutes!

If you look at the rather simple query, you'll see that it's not so much that my tool needs 71 minutes to run. The issue is rather that the database server takes 71 minutes for a query that shouldn't even take 71 seconds.

...

We are just short on hardware at the moment, and the situation will not change for some time, so we all have to live with the things we have and if a tool breaks because it consumes too much resources, then the dev. has to look if he/she can optimize it or it breaks!

The "work smarter, not harder" approach is neither helpful nor winning many sympathies.

I am aware that the toolserver needs both hardware and admin attention. I was not complaining about a lack of either. I was merely commenting on a very recent change for the worse, which might have an unrelated cause (a disk filled up, some configuration changed), and might thus have an easy fix. Yelling at people who mention a new and acute issue is not the way forward.

Magnus

Carl (CBM)

30 Oct 30 Oct

9:55 a.m.

On Wed, Oct 26, 2011 at 9:50 AM, Ja Ga jaga_x_1@yahoo.com wrote:

...

So, is anyone else seeing really horrible mysql performance on s1? I heard a mysql upgrade introduced regressions, but I'm seeing queries take 2x - 3x as long, sometimes worse. https://wiki.toolserver.org/view/Mailing_list_etiquette

Another problem that has started to appear very recently is that sometimes s1 (or sql-s1-user at least) seems to disappear. I got the following error from a perl script on willow in the past 8 hours:

DBI connect('database=p_enwp10:host=sql-s1-user:mysql_read_default_file=/home/project/e/n/w/enwp10/.my.cnf','',...) failed: Unknown MySQL server host 'sql-s1-user' (0) at database_routines.plline 627 Couldn't connect to database: Unknown MySQL server host 'sql-s1-user' (0) at database_routines.pl line 627.

- Carl

Carl (CBM)

23 Nov 23 Nov

11:16 a.m.

One more DB performance report/question. I am seeing some UPDATE queries that only change one row but take much, much longer than they ought to. Is anyone else seeing this?

For example, the following type of query is getting killed from time to time in the p_enwp10 database on sql-s1. The query killers says it ran for over 400 seconds before being killed. The update is on a primary key, and I don't see any way to optimize it. At the time this is running the database connection is inside a transaction (AutoCommit = 0) if that matters.

UPDATE tmpcategories SET c_category = 'A-Class_Water_supply_and_sanitation_articles', c_ranking = '425', c_replacement = 'A-Class' WHERE c_project = 'Water_supply_and_sanitation' and c_rating= 'A-Class' and c_type = 'quality'

- Carl

Platonides

8:13 p.m.

On 23/11/11 15:16, Carl (CBM) wrote:

...

One more DB performance report/question. I am seeing some UPDATE queries that only change one row but take much, much longer than they ought to. Is anyone else seeing this?

For example, the following type of query is getting killed from time to time in the p_enwp10 database on sql-s1. The query killers says it ran for over 400 seconds before being killed. The update is on a primary key, and I don't see any way to optimize it. At the time this is running the database connection is inside a transaction (AutoCommit = 0) if that matters.

UPDATE tmpcategories SET c_category = 'A-Class_Water_supply_and_sanitation_articles', c_ranking = '425', c_replacement = 'A-Class' WHERE c_project = 'Water_supply_and_sanitation' and c_rating= 'A-Class' and c_type = 'quality'

Carl

I guess you could make the query sorter by splitting it adding a LIMIT clause, then rexecing while the affected rows = limit. (Assuming you don't need the UPDATE to be atomic)

Carl (CBM)

9:54 p.m.

On Wed, Nov 23, 2011 at 6:13 PM, Platonides platonides@gmail.com wrote:

...

...
UPDATE tmpcategories SET c_category = 'A-Class_Water_supply_and_sanitation_articles', c_ranking = '425', c_replacement = 'A-Class' WHERE c_project = 'Water_supply_and_sanitation' and c_rating= 'A-Class' and c_type = 'quality'

I guess you could make the query sorter by splitting it adding a LIMIT clause, then rexecing while the affected rows = limit. (Assuming you don't need the UPDATE to be atomic)

The WHERE clause completely specifies a primary key, so at most one row is affected. That's why it is surprising to me that it would take very long.

- Carl

4775

Age (days ago)

4804

Last active (days ago)

toolserver-l@lists.wikimedia.org

15 comments

9 participants

tags (0)

participants (9)

Carl (CBM)
Christian Thiele
DaB.
Ja Ga
Krinkle
Magnus Manske
Marlen Caemmerer
Petr Onderka
Platonides