Another SQL optimization request

List overview All Threads
Download

newer

older

excessive lagging

[Cloud-announce] Dropping user...

Huji Lee

4 Jun 2019 4 Jun '19

10:53 p.m.

Hi all, I have a query that used to do a GROUP BY on rev_user and now I have migrated it to do so using rev_actor instead. The latest version looks like this <https://pastebin.com/jWTzsiJY> and its performance is not good <https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++actor_name%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+0+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+article%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+10+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+tpl%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+12+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+helppage%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+14+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+cat%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+100+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+portal%2C%0D%0A++COUNT%28rev_first%29+tot%0D%0AFROM+revision+r%0D%0AJOIN+%28%0D%0A++select%0D%0A++++MIN%28rev_id%29+rev_first%2C%0D%0A++++rev_page%0D%0A++FROM+revision%0D%0A++GROUP+BY+rev_page%0D%0A%29+f%0D%0A++ON+r.rev_id+%3D+f.rev_first%0D%0AJOIN+page%0D%0A++ON+page_id+%3D+r.rev_page%0D%0AJOIN+actor%0D%0A++ON+rev_actor+%3D+actor_id%0D%0ALEFT+JOIN+user_groups%0D%0A++ON+actor_user+%3D+ug_user%0D%0A++AND+ug_group+%3D+%22bot%22%0D%0AWHERE%0D%0A++actor_user+%3C%3E+0%0D%0A++AND+ug_group+IS+NULL%0D%0A++AND+page_namespace+IN+%28%0D%0A++++0%2C%0D%0A++++10%2C%0D%0A++++12%2C%0D%0A++++14%2C%0D%0A++++100%0D%0A++%29%0D%0AGROUP+BY+rev_actor%0D%0AORDER+BY+tot+desc%0D%0ALIMIT+300> at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query. Is this a side effect of the revision_actor_temp table? Also, what is that "key0" about, which is not associated with any tables? Finally, any advice on how to make it more efficient is highly appreciated. Thanks! Huji

Attachments:

attachment.htm (text/html — 3.0 KB)

Show replies by date

Bryan Davis

5 Jun 5 Jun

12:20 a.m.

On Tue, Jun 4, 2019 at 4:54 PM Huji Lee <huji.huji(a)gmail.com> wrote:

...

Hi all, I have a query that used to do a GROUP BY on rev_user and now I have migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query. Is this a side effect of the revision_actor_temp table?

The EXPLAIN output including filearchive, image, oldimage, and other apparently unrelated fields is caused by the view for the actor table. This is basically the same problem as I described in <https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html> for the view of the comment table. The flag for suppression lives in the related entity table, so the actor view runs correlated sub-queries for each entity table that could indicate that a specific row from the actor table has been suppressed. I created <https://phabricator.wikimedia.org/T224850> to look for a solution for this same class of issue in the comment table. If we find a fix there it seems likely that we could do the same thing for the actor table, whatever that may end up being.

...

Also, what is that "key0" about, which is not associated with any tables?

This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See <https://mariadb.com/kb/en/library/derived-table-with-key-optimization/> for some details on this type of internal optimization.

...

Finally, any advice on how to make it more efficient is highly appreciated.

The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database. I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.

...

[0]: https://pastebin.com/jWTzsiJY [1]: https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++…

Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855

Huji Lee

12:41 a.m.

All great answers; thank you! Let's wait and see what comes out of T224850. On Tue, Jun 4, 2019 at 8:21 PM Bryan Davis <bd808(a)wikimedia.org> wrote:

...

On Tue, Jun 4, 2019 at 4:54 PM Huji Lee <huji.huji(a)gmail.com> wrote:

Hi all, I have a query that used to do a GROUP BY on rev_user and now I have

migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.

Is this a side effect of the revision_actor_temp table?

Also, what is that "key0" about, which is not associated with any tables?

This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See < https://mariadb.com/kb/en/library/derived-table-with-key-optimization/> for some details on this type of internal optimization.

Finally, any advice on how to make it more efficient is highly

appreciated. The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database. I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.

[0]: https://pastebin.com/jWTzsiJY [1]:

https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++… Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855 _______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Jaime Crespo

8:30 a.m.

...

Finally, any advice on how to make it more efficient is highly appreciated.

I would be more concerned about the derived subquery, as that can create many executions of the inside query for every row of the outside one. If you could update the query with a comprehensive comment of what you are trying to achieve, that would help optimizing it, because some times transformation requires a logically different query but one that obtains equal or similar results. On Wed, Jun 5, 2019 at 2:21 AM Bryan Davis <bd808(a)wikimedia.org> wrote: > > On Tue, Jun 4, 2019 at 4:54 PM Huji Lee <huji.huji(a)gmail.com> wrote: > > > > Hi all, > > > > I have a query that used to do a GROUP BY on rev_user and now I have migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query. > > > > Is this a side effect of the revision_actor_temp table? > > The EXPLAIN output including filearchive, image, oldimage, and other > apparently unrelated fields is caused by the view for the actor table. > This is basically the same problem as I described in > <https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html> > for the view of the comment table. The flag for suppression lives in > the related entity table, so the actor view runs correlated > sub-queries for each entity table that could indicate that a specific > row from the actor table has been suppressed. > > I created <https://phabricator.wikimedia.org/T224850> to look for a > solution for this same class of issue in the comment table. If we find > a fix there it seems likely that we could do the same thing for the > actor table, whatever that may end up being. > > > Also, what is that "key0" about, which is not associated with any tables? > > This is a key generated by the database on a temporary table that is > being used to answer your query. It looks like there is a little > display bug in the sql-optimizer tool. The table name that should be > shown is "<derived2>" which you can see in the HTML source of the > page. See <https://mariadb.com/kb/en/library/derived-table-with-key-optimization/> > for some details on this type of internal optimization. > >

...

Finally, any advice on how to make it more efficient is highly appreciated.

> > The "Using filesort" notes in the plan are caused by the GROUP BYs. If > you can figure out how to get rid of the GROUP BY clauses things > should be faster, but to do that I think you might need to collect > information using multiple queries and some scripting language to > collect and correlate the data outside of the database. > > I would generally expect any query that tries to compute statistics > for all editors who have ever been active on a wiki will be slow. You > might try finding a good way to break the query into smaller batches. > > > [0]: https://pastebin.com/jWTzsiJY > > [1]: https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++… > > > Bryan > -- > Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> > [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA > irc: bd808 v:415.839.6885 x6855 > > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud -- Jaime Crespo <http://wikimedia.org>

Alex Monk

9:07 a.m.

It looks to me like the query attempts to count all logged-in non-bot page creations in certain namespaces, grouped by user, and each user also has their counts for the various different individual namespaces shown. On Wed, 5 Jun 2019 at 09:53, Jaime Crespo <jcrespo(a)wikimedia.org> wrote:

...

Finally, any advice on how to make it more efficient is highly

appreciated. I would be more concerned about the derived subquery, as that can create many executions of the inside query for every row of the outside one. If you could update the query with a comprehensive comment of what you are trying to achieve, that would help optimizing it, because some times transformation requires a logically different query but one that obtains equal or similar results. On Wed, Jun 5, 2019 at 2:21 AM Bryan Davis <bd808(a)wikimedia.org> wrote:

On Tue, Jun 4, 2019 at 4:54 PM Huji Lee <huji.huji(a)gmail.com> wrote: > > Hi all, > > I have a query that used to do a GROUP BY on rev_user and now I have

Is this a side effect of the revision_actor_temp table?

tables?

https://mariadb.com/kb/en/library/derived-table-with-key-optimization/> > for some details on this type of internal optimization. > >

Finally, any advice on how to make it more efficient is highly

appreciated.

https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++…

Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855 _______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Jaime Crespo <http://wikimedia.org> _______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

1:17 p.m.

Alex is correct. The goal of the query is to generate a table like this <https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4_%D8%AF%DB%8C%D8%AA%D8%A7%D8%A8%DB%8C%D8%B3/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1%D8%A7%D9%86_%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7_%D8%A8%D8%B1_%D9%BE%D8%A7%DB%8C%D9%87_%D8%AA%D8%B9%D8%AF%D8%A7%D8%AF_%D8%A7%DB%8C%D8%AC%D8%A7%D8%AF_%D8%B5%D9%81%D8%AD%D9%87%E2%80%8C%D9%87%D8%A7> (which is in Persian, so I will explain it) where each row is one user, the first five columns show the number of pages created by that user in various namespaces (article, template, help, category, and portal) and the last columns shows the overall count of page creations (in all namespaces). So for me (Huji, row 79 of that table), the numbers indicate that I have created a total of 2180 pages of which 1421 are articles, 522 are templates, etc. On Wed, Jun 5, 2019 at 5:08 AM Alex Monk <krenair(a)gmail.com> wrote:

...

Finally, any advice on how to make it more efficient is highly

On Tue, Jun 4, 2019 at 4:54 PM Huji Lee <huji.huji(a)gmail.com> wrote: > > Hi all, > > I have a query that used to do a GROUP BY on rev_user and now I have

Is this a side effect of the revision_actor_temp table?

tables?

https://mariadb.com/kb/en/library/derived-table-with-key-optimization/> > for some details on this type of internal optimization. > >

Finally, any advice on how to make it more efficient is highly

appreciated.

https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++…

_______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

1788

days inactive

1789

days old

cloud@lists.wikimedia.org

Manage subscription

5 comments

4 participants

tags (0)

participants (4)

Alex Monk
Bryan Davis
Huji Lee
Jaime Crespo