Adding indexes to EventLogging tables

List overview All Threads
Download

newer

older

Fwd: [Wiki-Medicine] need traffic...

Enwiki replication lag...

Gilles Dubuc

12 May 2014 12 May '14

5:13 a.m.

Is it possible? What's the process for it?

Some of the tables on db1047 I'm dealing with are getting massive (the main offender is 27.8GB) and some of our daily queries used to build tsvs are badly in need on indexes.

The SQL credentials I have access to don't have the rights to alter those tables.

Attachments:

attachment.htm (text/html — 352 bytes)

Show replies by date

Aaron Halfaker

12 May 12 May

5:40 a.m.

Hi Gilles,

We don't have a documented process yet. I'd be happy to be the contact for adding indexes since I need indexes so often myself. I've asked springle if this would be acceptable.

In the meantime, you could file a ticket in rt requesting the index.

-Aaron On May 12, 2014 11:13 AM, "Gilles Dubuc" gilles@wikimedia.org wrote:

...

Is it possible? What's the process for it?

Some of the tables on db1047 I'm dealing with are getting massive (the main offender is 27.8GB) and some of our daily queries used to build tsvs are badly in need on indexes.

The SQL credentials I have access to don't have the rights to alter those tables.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Gilles Dubuc

6:31 a.m.

Thanks Aaron,

I don't have access to rt itself, I've been sending requests by email before. Should this go to ops-requests@rt.wikimedia.org or to another rt address? I was unable to find a list of those rt email addresses that I presume send the request to the right team/project in rt.

On Mon, May 12, 2014 at 11:40 AM, Aaron Halfaker ahalfaker@wikimedia.orgwrote:

...

Hi Gilles,

We don't have a documented process yet. I'd be happy to be the contact for adding indexes since I need indexes so often myself. I've asked springle if this would be acceptable.

In the meantime, you could file a ticket in rt requesting the index.

-Aaron On May 12, 2014 11:13 AM, "Gilles Dubuc" gilles@wikimedia.org wrote:

...
Is it possible? What's the process for it?

Some of the tables on db1047 I'm dealing with are getting massive (the main offender is 27.8GB) and some of our daily queries used to build tsvs are badly in need on indexes.

The SQL credentials I have access to don't have the rights to alter those tables.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

6:32 a.m.

Hi Giles, Aaron

Actually, I've started collecting data so we an analyze the analysts... or at least your SQL ;-)

https://tendril.wikimedia.org/report/slow_queries?host= ^db1047&user=&schema=&qmode=eq&query=&hours=12

(only accessible to WMF staff)

Which ones are your tsvs stuff?

As you no doubt know, adding indexes to an RDBMS arbitrarily will eventually cause the system to struggle as overhead for writes increases. Also, while indexing is necessary and definitely acceptable, it shouldn't be the only port of call when performance tuning. So, I'd like to approach this carefully:

1. For the day-to-day queries run by your tools, like most of those queries on the URL above I expect, I'll begin adding some appropriate indexes, and also follow up with you guys if queries can be redesigned to be more efficient.

First example awaiting feedback: https://gerrit.wikimedia.org/r/#/c/131929/

2. For the ad-hoc or massive reporting queries like those Aaron runs from time to time, we give the halfak user DDL permissions for the "log" schema just as Aaron suggested, and let him spearhead the effort because I won't have a hope of predicting what you'll need.

However, regarding #2: We need to tread carefully because DDL for index operations is not necessarily something that can run any old time. Plain old ALTER TABLE will cause replication lag and table metadata locks, and I suggest whoever handles it should become familiar with percona toolkit. (Aaron? still keen?)

3. For queries that need indexes on wiki schemas, I think they will always need to be checked by Ops or a MW core dev. Obviously you guys could only affect yourselves, but I bet you wouldn't appreciate having db1047 broken by an incompatible upstream change.

What do you think?

BR Sean

Gilles Dubuc

6:48 a.m.

...

Which ones are your tsvs stuff?

The ones I'm trying to improve are querying MediaViewer_8245578 (just grep for that on the page you've linked to, there will be some things I ran manually today, but the big queries are the problematic ones). The existing queries are full table scans and seem to take around 15 minutes. We run 27 of these every day one after the other. The index I want to add will actually help me break those queries down into several smaller queries. There's no point trying to break them down right now, because I expect that without indexes the simpler queries will take just as long because they'll be full table scans as well.

On Mon, May 12, 2014 at 12:32 PM, Sean Pringle springle@wikimedia.orgwrote:

...

Hi Giles, Aaron

Actually, I've started collecting data so we an analyze the analysts... or at least your SQL ;-)

https://tendril.wikimedia.org/report/slow_queries?host= ^db1047&user=&schema=&qmode=eq&query=&hours=12

(only accessible to WMF staff)

Which ones are your tsvs stuff?

As you no doubt know, adding indexes to an RDBMS arbitrarily will eventually cause the system to struggle as overhead for writes increases. Also, while indexing is necessary and definitely acceptable, it shouldn't be the only port of call when performance tuning. So, I'd like to approach this carefully:

For the day-to-day queries run by your tools, like most of those

queries on the URL above I expect, I'll begin adding some appropriate indexes, and also follow up with you guys if queries can be redesigned to be more efficient.

First example awaiting feedback: https://gerrit.wikimedia.org/r/#/c/131929/

For the ad-hoc or massive reporting queries like those Aaron runs from

time to time, we give the halfak user DDL permissions for the "log" schema just as Aaron suggested, and let him spearhead the effort because I won't have a hope of predicting what you'll need.

However, regarding #2: We need to tread carefully because DDL for index operations is not necessarily something that can run any old time. Plain old ALTER TABLE will cause replication lag and table metadata locks, and I suggest whoever handles it should become familiar with percona toolkit. (Aaron? still keen?)

For queries that need indexes on wiki schemas, I think they will always

need to be checked by Ops or a MW core dev. Obviously you guys could only affect yourselves, but I bet you wouldn't appreciate having db1047 broken by an incompatible upstream change.

What do you think?

BR Sean

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

8:33 a.m.

On Mon, May 12, 2014 at 8:48 PM, Gilles Dubuc gilles@wikimedia.org wrote:

...

Which ones are your tsvs stuff?

...
The ones I'm trying to improve are querying MediaViewer_8245578 (just grep for that on the page you've linked to, there will be some things I ran manually today, but the big queries are the problematic ones). The existing queries are full table scans and seem to take around 15 minutes. We run 27 of these every day one after the other. The index I want to add will actually help me break those queries down into several smaller queries. There's no point trying to break them down right now, because I expect that without indexes the simpler queries will take just as long because they'll be full table scans as well.

Ok, think I see the right ones.

If you want to avoid breaking the queries up, push the conditions down onto each base table of the union:

SELECT ... FROM ( SELECT timestamp, wiki, event_action FROM MediaViewer_7670440 WHERE wiki = 'cawiki' AND timestamp < TIMESTAMP(CURDATE()) AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY)) UNION ALL SELECT timestamp, wiki, event_action FROM MediaViewer_8245578 WHERE wiki = 'cawiki' AND timestamp < TIMESTAMP(CURDATE()) AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY)) ) AS MediaViewerUnioned GROUP BY datestring ORDER BY datestring ASC;

That will allow a range access on the base tables' timestamp indexes which should reduce the row count a bit. Then if we add an index on (wiki, timestamp), which I'll kick off now, it drops the execution time to ~1min.

UNION ALL avoids the sorting overhead of UNION DISTINCT but unfortunately still materializes the entire set into a temporary table. Pushing the conditions down makes the query more verbose, but perhaps that's still easier than maintaining multiple queries and emulating union.

Sean

Gilles Dubuc

9:36 a.m.

Thanks for adding that index, that's exactly what I needed. I'll update the queries to what you suggested now. I actually looked into doing exactly that change earlier today, but couldn't get it to hit the timestamp index. I wasn't specifying a value for "wiki" (it's one of the queries we run, the "global" counts), and it wasn't hitting the index, probably because of NULL values. Now I can force it hit that new index, so everything should be back to sane execution time for those tables.

On Mon, May 12, 2014 at 2:33 PM, Sean Pringle springle@wikimedia.orgwrote:

...

On Mon, May 12, 2014 at 8:48 PM, Gilles Dubuc gilles@wikimedia.orgwrote:

...
Which ones are your tsvs stuff?

...
The ones I'm trying to improve are querying MediaViewer_8245578 (just grep for that on the page you've linked to, there will be some things I ran manually today, but the big queries are the problematic ones). The existing queries are full table scans and seem to take around 15 minutes. We run 27 of these every day one after the other. The index I want to add will actually help me break those queries down into several smaller queries. There's no point trying to break them down right now, because I expect that without indexes the simpler queries will take just as long because they'll be full table scans as well.

Ok, think I see the right ones.

If you want to avoid breaking the queries up, push the conditions down onto each base table of the union:

SELECT ... FROM ( SELECT timestamp, wiki, event_action FROM MediaViewer_7670440 WHERE wiki = 'cawiki' AND timestamp < TIMESTAMP(CURDATE()) AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY)) UNION ALL SELECT timestamp, wiki, event_action FROM MediaViewer_8245578 WHERE wiki = 'cawiki' AND timestamp < TIMESTAMP(CURDATE()) AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY)) ) AS MediaViewerUnioned GROUP BY datestring ORDER BY datestring ASC;

That will allow a range access on the base tables' timestamp indexes which should reduce the row count a bit. Then if we add an index on (wiki, timestamp), which I'll kick off now, it drops the execution time to ~1min.

UNION ALL avoids the sorting overhead of UNION DISTINCT but unfortunately still materializes the entire set into a temporary table. Pushing the conditions down makes the query more verbose, but perhaps that's still easier than maintaining multiple queries and emulating union.

Sean

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Gilles Dubuc

9:48 a.m.

It would be awesome if you could add a similar (wiki, timestamp) index for exactly the same reasons to all the MultimediaViewerNetworkPerformance* tables on the same database. Those tables haven't been problematic yet because they're a lot smaller, but we might as well make all Multimedia EventLogging queries faster in one swoop.

On Mon, May 12, 2014 at 3:36 PM, Gilles Dubuc gilles@wikimedia.org wrote:

...

Thanks for adding that index, that's exactly what I needed. I'll update the queries to what you suggested now. I actually looked into doing exactly that change earlier today, but couldn't get it to hit the timestamp index. I wasn't specifying a value for "wiki" (it's one of the queries we run, the "global" counts), and it wasn't hitting the index, probably because of NULL values. Now I can force it hit that new index, so everything should be back to sane execution time for those tables.

On Mon, May 12, 2014 at 2:33 PM, Sean Pringle springle@wikimedia.orgwrote:

...
On Mon, May 12, 2014 at 8:48 PM, Gilles Dubuc gilles@wikimedia.orgwrote:

...
Which ones are your tsvs stuff?

...
The ones I'm trying to improve are querying MediaViewer_8245578 (just grep for that on the page you've linked to, there will be some things I ran manually today, but the big queries are the problematic ones). The existing queries are full table scans and seem to take around 15 minutes. We run 27 of these every day one after the other. The index I want to add will actually help me break those queries down into several smaller queries. There's no point trying to break them down right now, because I expect that without indexes the simpler queries will take just as long because they'll be full table scans as well.

Ok, think I see the right ones.

If you want to avoid breaking the queries up, push the conditions down onto each base table of the union:

SELECT ... FROM ( SELECT timestamp, wiki, event_action FROM MediaViewer_7670440 WHERE wiki = 'cawiki' AND timestamp < TIMESTAMP(CURDATE()) AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY)) UNION ALL SELECT timestamp, wiki, event_action FROM MediaViewer_8245578 WHERE wiki = 'cawiki' AND timestamp < TIMESTAMP(CURDATE()) AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY)) ) AS MediaViewerUnioned GROUP BY datestring ORDER BY datestring ASC;

That will allow a range access on the base tables' timestamp indexes which should reduce the row count a bit. Then if we add an index on (wiki, timestamp), which I'll kick off now, it drops the execution time to ~1min.

UNION ALL avoids the sorting overhead of UNION DISTINCT but unfortunately still materializes the entire set into a temporary table. Pushing the conditions down makes the query more verbose, but perhaps that's still easier than maintaining multiple queries and emulating union.

Sean

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

10:48 a.m.

On May 12, 2014 11:48 PM, "Gilles Dubuc" gilles@wikimedia.org wrote:

...

It would be awesome if you could add a similar (wiki, timestamp) index

for exactly the same reasons to all the MultimediaViewerNetworkPerformance* tables on the same database. Those tables haven't been problematic yet because they're a lot smaller, but we might as well make all Multimedia EventLogging queries faster in one swoop.

OK. Will do.

Sean Pringle

13 May 13 May

6:39 a.m.

On Mon, May 12, 2014 at 11:48 PM, Gilles Dubuc gilles@wikimedia.org wrote:

...

It would be awesome if you could add a similar (wiki, timestamp) index for exactly the same reasons to all the MultimediaViewerNetworkPerformance* tables on the same database. Those tables haven't been problematic yet because they're a lot smaller, but we might as well make all Multimedia EventLogging queries faster in one swoop.

This is in progress, but is happening online to avoid replag so will probably take a couple days.

I set pt-online-schema-change to simply index (wiki,timestamp) on all large log tables. There are other queries in the slow log that look like they might benefit.

Sean Pringle

12 May 12 May

10:46 a.m.

On May 12, 2014 11:36 PM, "Gilles Dubuc" gilles@wikimedia.org wrote:

...

Thanks for adding that index, that's exactly what I needed. I'll update

the queries to what you suggested now. I actually looked into doing exactly that change earlier today, but couldn't get it to hit the timestamp index. I wasn't specifying a value for "wiki" (it's one of the queries we run, the "global" counts), and it wasn't hitting the index, probably because of NULL values. Now I can force it hit that new index, so everything should be back to sane execution time for those tables.

I noticed the IS NOT NULL version appear. Forcing the index in that case may not help; I suspect not-null will just cause an index scan and double the overhead compared to the table scan (because the secondary index isn't clustered).

However the log tables are using TokuDB, instead of InnoDB, now, so it should be possible to have multiple clustered indexes. Will experiment with it more tomorrow and report back.

Gilles Dubuc

12:16 p.m.

...

I noticed the IS NOT NULL version appear. Forcing the index in that case may not help; I suspect not-null will just cause an index scan and double the overhead compared to the table scan (because the secondary index isn't clustered).

So what's the best thing to do for now? Not adding "IS NOT NULL" and letting it do a full table scan? That global query is indeed the last one that remains noticeably slow (it's running over 120+ million rows).

On Mon, May 12, 2014 at 4:46 PM, Sean Pringle springle@wikimedia.orgwrote:

...

On May 12, 2014 11:36 PM, "Gilles Dubuc" gilles@wikimedia.org wrote:

...
Thanks for adding that index, that's exactly what I needed. I'll update

the queries to what you suggested now. I actually looked into doing exactly that change earlier today, but couldn't get it to hit the timestamp index. I wasn't specifying a value for "wiki" (it's one of the queries we run, the "global" counts), and it wasn't hitting the index, probably because of NULL values. Now I can force it hit that new index, so everything should be back to sane execution time for those tables.

I noticed the IS NOT NULL version appear. Forcing the index in that case may not help; I suspect not-null will just cause an index scan and double the overhead compared to the table scan (because the secondary index isn't clustered).

However the log tables are using TokuDB, instead of InnoDB, now, so it should be possible to have multiple clustered indexes. Will experiment with it more tomorrow and report back.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

13 May 13 May

6:36 a.m.

On Tue, May 13, 2014 at 2:16 AM, Gilles Dubuc gilles@wikimedia.org wrote:

...

I noticed the IS NOT NULL version appear. Forcing the index in that case

...
may not help; I suspect not-null will just cause an index scan and double the overhead compared to the table scan (because the secondary index isn't clustered).

So what's the best thing to do for now? Not adding "IS NOT NULL" and letting it do a full table scan? That global query is indeed the last one that remains noticeably slow (it's running over 120+ million rows).

There are zero NULL values for "wiki" in those tables, so forcing the secondary index won't help and will be slightly slower than a table scan on the clustered primary key (extra level of indirection for each row touched).

I suggest removing the FORCE INDEX clauses entirely from both forms of the query. Just keep the conditions pushed down to the unioned tables for now.

When filtering by a wiki name, the ix_*_wiki_timestamp indexes are chosen regardless.

When not filtering by wiki, the optimizer should be free to choose the is_*_timestamp index if possible (ie, if the tables ever hold a lot of data older than the 30 day range).

Aaron Halfaker

12 May 12 May

5:02 p.m.

...

(Aaron? still keen?)

Totally. :). I'm down for some query performance review too. I'm already doing that informally.

...

For queries that need indexes on wiki schemas, I think they will always

need to be checked by Ops or a MW core dev.

Sounds reasonable. Are you imagining a gerrit style code review? On May 12, 2014 6:32 AM, "Sean Pringle" springle@wikimedia.org wrote:

...

Hi Giles, Aaron

Actually, I've started collecting data so we an analyze the analysts... or at least your SQL ;-)

https://tendril.wikimedia.org/report/slow_queries?host= ^db1047&user=&schema=&qmode=eq&query=&hours=12

(only accessible to WMF staff)

Which ones are your tsvs stuff?

As you no doubt know, adding indexes to an RDBMS arbitrarily will eventually cause the system to struggle as overhead for writes increases. Also, while indexing is necessary and definitely acceptable, it shouldn't be the only port of call when performance tuning. So, I'd like to approach this carefully:

For the day-to-day queries run by your tools, like most of those

queries on the URL above I expect, I'll begin adding some appropriate indexes, and also follow up with you guys if queries can be redesigned to be more efficient.

First example awaiting feedback: https://gerrit.wikimedia.org/r/#/c/131929/

For the ad-hoc or massive reporting queries like those Aaron runs from

time to time, we give the halfak user DDL permissions for the "log" schema just as Aaron suggested, and let him spearhead the effort because I won't have a hope of predicting what you'll need.

However, regarding #2: We need to tread carefully because DDL for index operations is not necessarily something that can run any old time. Plain old ALTER TABLE will cause replication lag and table metadata locks, and I suggest whoever handles it should become familiar with percona toolkit. (Aaron? still keen?)

For queries that need indexes on wiki schemas, I think they will always

need to be checked by Ops or a MW core dev. Obviously you guys could only affect yourselves, but I bet you wouldn't appreciate having db1047 broken by an incompatible upstream change.

What do you think?

BR Sean

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

13 May 13 May

6:55 a.m.

On Tue, May 13, 2014 at 7:02 AM, Aaron Halfaker ahalfaker@wikimedia.orgwrote:

...

...
(Aaron? still keen?)

Totally. :). I'm down for some query performance review too. I'm already doing that informally.

Great! I therefore added you to a pending gerrit review ;-)

...

...
For queries that need indexes on wiki schemas, I think they will always

need to be checked by Ops or a MW core dev.

Sounds reasonable. Are you imagining a gerrit style code review?

Yes, that would work.

Also need to consider the analytics-store box. Its indexes probably should stay in sync with s1-analytics-slave, especially once it gets scratch space.

Dario Taraborelli

2:04 p.m.

Sean,

tendril is really awesome. I too would love to review the performance of some queries used for the EE dashboards. One in particular [1] used to be fairly fast and is now taking an ugly lot of time to complete, possibly due to some schema change I was unaware of.

I’ll drop you a line off-thread to discuss this, but I am very glad we have this tool to monitor performance.

Dario

[1] checksum: 356942383120bc50242e605e0c40fe81

On May 12, 2014, at 3:32 AM, Sean Pringle springle@wikimedia.org wrote:

...

Hi Giles, Aaron

Actually, I've started collecting data so we an analyze the analysts... or at least your SQL ;-)

https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1047&user=&a...

(only accessible to WMF staff)

Which ones are your tsvs stuff?

As you no doubt know, adding indexes to an RDBMS arbitrarily will eventually cause the system to struggle as overhead for writes increases. Also, while indexing is necessary and definitely acceptable, it shouldn't be the only port of call when performance tuning. So, I'd like to approach this carefully:

For the day-to-day queries run by your tools, like most of those queries on the URL above I expect, I'll begin adding some appropriate indexes, and also follow up with you guys if queries can be redesigned to be more efficient.

First example awaiting feedback: https://gerrit.wikimedia.org/r/#/c/131929/

For the ad-hoc or massive reporting queries like those Aaron runs from time to time, we give the halfak user DDL permissions for the "log" schema just as Aaron suggested, and let him spearhead the effort because I won't have a hope of predicting what you'll need.

However, regarding #2: We need to tread carefully because DDL for index operations is not necessarily something that can run any old time. Plain old ALTER TABLE will cause replication lag and table metadata locks, and I suggest whoever handles it should become familiar with percona toolkit. (Aaron? still keen?)

For queries that need indexes on wiki schemas, I think they will always need to be checked by Ops or a MW core dev. Obviously you guys could only affect yourselves, but I bet you wouldn't appreciate having db1047 broken by an incompatible upstream change.

What do you think?

BR Sean _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3856

Age (days ago)

3857

Last active (days ago)

analytics@lists.wikimedia.org

15 comments

4 participants

tags (0)

participants (4)

Aaron Halfaker
Dario Taraborelli
Gilles Dubuc
Sean Pringle