On Tue, Jun 4, 2019 at 4:54 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
I have a query that used to do a GROUP BY on rev_user and now I have migrated it to do so using rev_actor instead. The latest version looks like this [0] and its performance is not good [1] at all. In fact, I left it running for fawiki for several hours and it did not generate any results (it used to generate results in about 30-40 minutes). Also, the execution plan that is shown in the second link involves some tables like filearchive, image and oldimage that are not part of the query.
Is this a side effect of the revision_actor_temp table?
The EXPLAIN output including filearchive, image, oldimage, and other apparently unrelated fields is caused by the view for the actor table. This is basically the same problem as I described in https://lists.wikimedia.org/pipermail/cloud/2019-June/000683.html for the view of the comment table. The flag for suppression lives in the related entity table, so the actor view runs correlated sub-queries for each entity table that could indicate that a specific row from the actor table has been suppressed.
I created https://phabricator.wikimedia.org/T224850 to look for a solution for this same class of issue in the comment table. If we find a fix there it seems likely that we could do the same thing for the actor table, whatever that may end up being.
Also, what is that "key0" about, which is not associated with any tables?
This is a key generated by the database on a temporary table that is being used to answer your query. It looks like there is a little display bug in the sql-optimizer tool. The table name that should be shown is "<derived2>" which you can see in the HTML source of the page. See https://mariadb.com/kb/en/library/derived-table-with-key-optimization/ for some details on this type of internal optimization.
Finally, any advice on how to make it more efficient is highly appreciated.
The "Using filesort" notes in the plan are caused by the GROUP BYs. If you can figure out how to get rid of the GROUP BY clauses things should be faster, but to do that I think you might need to collect information using multiple queries and some scripting language to collect and correlate the data outside of the database.
I would generally expect any query that tries to compute statistics for all editors who have ever been active on a wiki will be slow. You might try finding a good way to break the query into smaller batches.
Bryan