s1-analytics-slave impressively slow queries

List overview All Threads
Download

newer

older

"Infrastructure for Data Streams"...

Some tables to purge from the...

Sean Pringle

10 Nov 2014 10 Nov '14

1:55 a.m.

Three identical queries from the 'research_prod' user have just passed one month execution time on s1-anlytics-slave:

select count(*) from staging.ourvision r where exists ( select * from staging.ourvision r1 inner join staging.ourvision r2 on r2.sha1 = r1.sha1 where r1.page_id = r.page_id and r2.page_id = r.page_id and DATE_ADD(r.timestamp, INTERVAL 1 HOUR) and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp , INTERVAL 1 HOUR) and r1.sha1!= r.sha1 );

I havn't checked to see if the queries are just that amazingly slow, or if they're part of a larger ongoing transaction. In any case, three month-long transactions is pushing the resource limits of the slave and will soon result in either mass replication lag or some other interesting lockup that may in turn take days to rollback :-)

Can we kill these? Can we optimize and/or redesign the jobs? Happy to help...

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Nuria Ruiz

10 Nov 10 Nov

8:01 a.m.

cc-ing leila as we were experimenting with these some weeks back in SF, I think they can be killed w/o problems. I did not know they were still running, we run a faster version of those queries and got the data we were interested in a while back.

On Mon, Nov 10, 2014 at 1:55 AM, Sean Pringle springle@wikimedia.org wrote:

...

Three identical queries from the 'research_prod' user have just passed one month execution time on s1-anlytics-slave:

select count(*) from staging.ourvision r where exists ( select * from staging.ourvision r1 inner join staging.ourvision r2 on r2.sha1 = r1.sha1 where r1.page_id = r.page_id and r2.page_id = r.page_id and DATE_ADD(r.timestamp, INTERVAL 1 HOUR) and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp , INTERVAL 1 HOUR) and r1.sha1!= r.sha1 );

I havn't checked to see if the queries are just that amazingly slow, or if they're part of a larger ongoing transaction. In any case, three month-long transactions is pushing the resource limits of the slave and will soon result in either mass replication lag or some other interesting lockup that may in turn take days to rollback :-)

Can we kill these? Can we optimize and/or redesign the jobs? Happy to help...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

10:05 a.m.

Let's kill them (Leila is OoO today and tomorrow).

...

On Nov 10, 2014, at 08:01, Nuria Ruiz nuria@wikimedia.org wrote:

cc-ing leila as we were experimenting with these some weeks back in SF, I think they can be killed w/o problems. I did not know they were still running, we run a faster version of those queries and got the data we were interested in a while back.

...
On Mon, Nov 10, 2014 at 1:55 AM, Sean Pringle springle@wikimedia.org wrote: Three identical queries from the 'research_prod' user have just passed one month execution time on s1-anlytics-slave:

select count(*) from staging.ourvision r where exists ( select * from staging.ourvision r1 inner join staging.ourvision r2 on r2.sha1 = r1.sha1 where r1.page_id = r.page_id and r2.page_id = r.page_id and DATE_ADD(r.timestamp, INTERVAL 1 HOUR) and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp , INTERVAL 1 HOUR) and r1.sha1!= r.sha1 );

I havn't checked to see if the queries are just that amazingly slow, or if they're part of a larger ongoing transaction. In any case, three month-long transactions is pushing the resource limits of the slave and will soon result in either mass replication lag or some other interesting lockup that may in turn take days to rollback :-)

Can we kill these? Can we optimize and/or redesign the jobs? Happy to help...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

2:44 p.m.

Sean what came of your discussion with Coren about limiting time or memory of queries? I think we should totally start enforcing those kinds of limits as it seems any queries running longer than a few days are usually accidents.

On Monday, November 10, 2014, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...

Let's kill them (Leila is OoO today and tomorrow).

On Nov 10, 2014, at 08:01, Nuria Ruiz <nuria@wikimedia.org javascript:_e(%7B%7D,'cvml','nuria@wikimedia.org');> wrote:

cc-ing leila as we were experimenting with these some weeks back in SF, I think they can be killed w/o problems. I did not know they were still running, we run a faster version of those queries and got the data we were interested in a while back.

On Mon, Nov 10, 2014 at 1:55 AM, Sean Pringle <springle@wikimedia.org javascript:_e(%7B%7D,'cvml','springle@wikimedia.org');> wrote:

...
Three identical queries from the 'research_prod' user have just passed one month execution time on s1-anlytics-slave:

select count(*) from staging.ourvision r where exists ( select * from staging.ourvision r1 inner join staging.ourvision r2 on r2.sha1 = r1.sha1 where r1.page_id = r.page_id and r2.page_id = r.page_id and DATE_ADD(r.timestamp, INTERVAL 1 HOUR) and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp , INTERVAL 1 HOUR) and r1.sha1!= r.sha1 );

I havn't checked to see if the queries are just that amazingly slow, or if they're part of a larger ongoing transaction. In any case, three month-long transactions is pushing the resource limits of the slave and will soon result in either mass replication lag or some other interesting lockup that may in turn take days to rollback :-)

Can we kill these? Can we optimize and/or redesign the jobs? Happy to help...

Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

6:31 p.m.

On Tue, Nov 11, 2014 at 8:44 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Sean what came of your discussion with Coren about limiting time or memory of queries? I think we should totally start enforcing those kinds of limits as it seems any queries running longer than a few days are usually accidents.

That discussion (ongoing) pertains to labsdb replicas. These queries are on s1-analytics-slave, which I think would deserve a different, more flexible, approach? Certainly the mechanism to kill queries based on rules, exists.

Another option might be an icinga alert sent only to you guys? Or perhaps to analytics + otto + gage + me?

Leila Zia

11 Nov 11 Nov

10:19 p.m.

Sean,

What Nuria said. It seems we've missed this one. Sorry for the trouble.

Leila

On Mon, Nov 10, 2014 at 8:01 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

cc-ing leila as we were experimenting with these some weeks back in SF, I think they can be killed w/o problems. I did not know they were still running, we run a faster version of those queries and got the data we were interested in a while back.

On Mon, Nov 10, 2014 at 1:55 AM, Sean Pringle springle@wikimedia.org wrote:

...
Three identical queries from the 'research_prod' user have just passed one month execution time on s1-anlytics-slave:

select count(*) from staging.ourvision r where exists ( select * from staging.ourvision r1 inner join staging.ourvision r2 on r2.sha1 = r1.sha1 where r1.page_id = r.page_id and r2.page_id = r.page_id and DATE_ADD(r.timestamp, INTERVAL 1 HOUR) and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp , INTERVAL 1 HOUR) and r1.sha1!= r.sha1 );

I havn't checked to see if the queries are just that amazingly slow, or if they're part of a larger ongoing transaction. In any case, three month-long transactions is pushing the resource limits of the slave and will soon result in either mass replication lag or some other interesting lockup that may in turn take days to rollback :-)

Can we kill these? Can we optimize and/or redesign the jobs? Happy to help...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3694

Age (days ago)

3696

Last active (days ago)

analytics@lists.wikimedia.org

5 comments

5 participants

tags (0)

participants (5)

Dan Andreescu
Dario Taraborelli
Leila Zia
Nuria Ruiz
Sean Pringle