Hi all,
I have a query that used to do a GROUP BY on rev_user and now I have migrated it to do so using rev_actor instead. The latest version looks like this https://pastebin.com/jWTzsiJY and its performance is not good https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++actor_name%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+0+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+article%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+10+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+tpl%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+12+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+helppage%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+14+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+cat%2C%0D%0A++SUM%28%0D%0A++++CASE%0D%0A++++++WHEN+page_namespace+%3D+100+THEN+1%0D%0A++++++ELSE+0%0D%0A++++END%0D%0A++%29+portal%2C%0D%0A++COUNT%28rev_first%29+tot%0D%0AFROM+revision+r%0D%0AJOIN+%28%0D%0A++select%0D%0A++++MIN%28rev_id%29+rev_first%2C%0D%0A++++rev_page%0D%0A++FROM+revision%0D%0A++GROUP+BY+rev_page%0D%0A%29+f%0D%0A++ON+r.rev_id+%3D+f.rev_first%0D%0AJOIN+page%0D%0A++ON+page_id+%3D+r.rev_page%0D%0AJOIN+actor%0D%0A++ON+rev_actor+%3D+actor_id%0D%0ALEFT+JOIN+user_groups%0D%0A++ON+actor_user+%3D+ug_user%0D%0A++AND+ug_group+%3D+%22bot%22%0D%0AWHERE%0D%0A++actor_user+%3C%3E+0%0D%0A++AND+ug_group+IS+NULL%0D%0A++AND+page_namespace+IN+%28%0D%0A++++0%2C%0D%0A++++10%2C%0D%0A++++12%2C%0D%0A++++14%2C%0D%0A++++100%0D%0A++%29%0D%0AGROUP+BY+rev_actor%0D%0AORDER+BY+tot+desc%0D%0ALIMIT+300 at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.
Is this a side effect of the revision_actor_temp table?
Also, what is that "key0" about, which is not associated with any tables?
Finally, any advice on how to make it more efficient is highly appreciated.
Thanks! Huji
On Tue, Jun 4, 2019 at 4:54 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
I have a query that used to do a GROUP BY on rev_user and now I have migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.
Is this a side effect of the revision_actor_temp table?
The EXPLAIN output including filearchive, image, oldimage, and other apparently unrelated fields is caused by the view for the actor table. This is basically the same problem as I described in https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html for the view of the comment table. The flag for suppression lives in the related entity table, so the actor view runs correlated sub-queries for each entity table that could indicate that a specific row from the actor table has been suppressed.
I created https://phabricator.wikimedia.org/T224850 to look for a solution for this same class of issue in the comment table. If we find a fix there it seems likely that we could do the same thing for the actor table, whatever that may end up being.
Also, what is that "key0" about, which is not associated with any tables?
This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See https://mariadb.com/kb/en/library/derived-table-with-key-optimization/ for some details on this type of internal optimization.
Finally, any advice on how to make it more efficient is highly appreciated.
The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database.
I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.
Bryan
All great answers; thank you! Let's wait and see what comes out of T224850.
On Tue, Jun 4, 2019 at 8:21 PM Bryan Davis bd808@wikimedia.org wrote:
On Tue, Jun 4, 2019 at 4:54 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
I have a query that used to do a GROUP BY on rev_user and now I have
migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.
Is this a side effect of the revision_actor_temp table?
The EXPLAIN output including filearchive, image, oldimage, and other apparently unrelated fields is caused by the view for the actor table. This is basically the same problem as I described in https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html for the view of the comment table. The flag for suppression lives in the related entity table, so the actor view runs correlated sub-queries for each entity table that could indicate that a specific row from the actor table has been suppressed.
I created https://phabricator.wikimedia.org/T224850 to look for a solution for this same class of issue in the comment table. If we find a fix there it seems likely that we could do the same thing for the actor table, whatever that may end up being.
Also, what is that "key0" about, which is not associated with any tables?
This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See < https://mariadb.com/kb/en/library/derived-table-with-key-optimization/%3E for some details on this type of internal optimization.
Finally, any advice on how to make it more efficient is highly
appreciated.
The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database.
I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.
[1]:
https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++a...
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Finally, any advice on how to make it more efficient is highly appreciated.
I would be more concerned about the derived subquery, as that can create many executions of the inside query for every row of the outside one.
If you could update the query with a comprehensive comment of what you are trying to achieve, that would help optimizing it, because some times transformation requires a logically different query but one that obtains equal or similar results.
On Wed, Jun 5, 2019 at 2:21 AM Bryan Davis bd808@wikimedia.org wrote:
On Tue, Jun 4, 2019 at 4:54 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
I have a query that used to do a GROUP BY on rev_user and now I have migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.
Is this a side effect of the revision_actor_temp table?
The EXPLAIN output including filearchive, image, oldimage, and other apparently unrelated fields is caused by the view for the actor table. This is basically the same problem as I described in https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html for the view of the comment table. The flag for suppression lives in the related entity table, so the actor view runs correlated sub-queries for each entity table that could indicate that a specific row from the actor table has been suppressed.
I created https://phabricator.wikimedia.org/T224850 to look for a solution for this same class of issue in the comment table. If we find a fix there it seems likely that we could do the same thing for the actor table, whatever that may end up being.
Also, what is that "key0" about, which is not associated with any tables?
This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See https://mariadb.com/kb/en/library/derived-table-with-key-optimization/ for some details on this type of internal optimization.
Finally, any advice on how to make it more efficient is highly appreciated.
The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database.
I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
It looks to me like the query attempts to count all logged-in non-bot page creations in certain namespaces, grouped by user, and each user also has their counts for the various different individual namespaces shown.
On Wed, 5 Jun 2019 at 09:53, Jaime Crespo jcrespo@wikimedia.org wrote:
Finally, any advice on how to make it more efficient is highly
appreciated.
I would be more concerned about the derived subquery, as that can create many executions of the inside query for every row of the outside one.
If you could update the query with a comprehensive comment of what you are trying to achieve, that would help optimizing it, because some times transformation requires a logically different query but one that obtains equal or similar results.
On Wed, Jun 5, 2019 at 2:21 AM Bryan Davis bd808@wikimedia.org wrote:
On Tue, Jun 4, 2019 at 4:54 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
I have a query that used to do a GROUP BY on rev_user and now I have
migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.
Is this a side effect of the revision_actor_temp table?
The EXPLAIN output including filearchive, image, oldimage, and other apparently unrelated fields is caused by the view for the actor table. This is basically the same problem as I described in https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html for the view of the comment table. The flag for suppression lives in the related entity table, so the actor view runs correlated sub-queries for each entity table that could indicate that a specific row from the actor table has been suppressed.
I created https://phabricator.wikimedia.org/T224850 to look for a solution for this same class of issue in the comment table. If we find a fix there it seems likely that we could do the same thing for the actor table, whatever that may end up being.
Also, what is that "key0" about, which is not associated with any
tables?
This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See <
https://mariadb.com/kb/en/library/derived-table-with-key-optimization/%3E
for some details on this type of internal optimization.
Finally, any advice on how to make it more efficient is highly
appreciated.
The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database.
I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.
[1]:
https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++a...
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
-- Jaime Crespo http://wikimedia.org
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Alex is correct. The goal of the query is to generate a table like this https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4_%D8%AF%DB%8C%D8%AA%D8%A7%D8%A8%DB%8C%D8%B3/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1%D8%A7%D9%86_%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7_%D8%A8%D8%B1_%D9%BE%D8%A7%DB%8C%D9%87_%D8%AA%D8%B9%D8%AF%D8%A7%D8%AF_%D8%A7%DB%8C%D8%AC%D8%A7%D8%AF_%D8%B5%D9%81%D8%AD%D9%87%E2%80%8C%D9%87%D8%A7 (which is in Persian, so I will explain it) where each row is one user, the first five columns show the number of pages created by that user in various namespaces (article, template, help, category, and portal) and the last columns shows the overall count of page creations (in all namespaces).
So for me (Huji, row 79 of that table), the numbers indicate that I have created a total of 2180 pages of which 1421 are articles, 522 are templates, etc.
On Wed, Jun 5, 2019 at 5:08 AM Alex Monk krenair@gmail.com wrote:
It looks to me like the query attempts to count all logged-in non-bot page creations in certain namespaces, grouped by user, and each user also has their counts for the various different individual namespaces shown.
On Wed, 5 Jun 2019 at 09:53, Jaime Crespo jcrespo@wikimedia.org wrote:
Finally, any advice on how to make it more efficient is highly
appreciated.
I would be more concerned about the derived subquery, as that can create many executions of the inside query for every row of the outside one.
If you could update the query with a comprehensive comment of what you are trying to achieve, that would help optimizing it, because some times transformation requires a logically different query but one that obtains equal or similar results.
On Wed, Jun 5, 2019 at 2:21 AM Bryan Davis bd808@wikimedia.org wrote:
On Tue, Jun 4, 2019 at 4:54 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
I have a query that used to do a GROUP BY on rev_user and now I have
migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.
Is this a side effect of the revision_actor_temp table?
The EXPLAIN output including filearchive, image, oldimage, and other apparently unrelated fields is caused by the view for the actor table. This is basically the same problem as I described in https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html for the view of the comment table. The flag for suppression lives in the related entity table, so the actor view runs correlated sub-queries for each entity table that could indicate that a specific row from the actor table has been suppressed.
I created https://phabricator.wikimedia.org/T224850 to look for a solution for this same class of issue in the comment table. If we find a fix there it seems likely that we could do the same thing for the actor table, whatever that may end up being.
Also, what is that "key0" about, which is not associated with any
tables?
This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See <
https://mariadb.com/kb/en/library/derived-table-with-key-optimization/%3E
for some details on this type of internal optimization.
Finally, any advice on how to make it more efficient is highly
appreciated.
The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database.
I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.
[1]:
https://tools.wmflabs.org/sql-optimizer?use=fawiki_p&sql=SELECT%0D%0A++a...
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
-- Jaime Crespo http://wikimedia.org
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud