I have a query that tries to find pages in article namespace that contain a link to some page in the user/user talk namespace. Many of these result from users signing in the articles (which is obviously not appropriate). The query also tries to exclude pages that have a legitimate link to user pages; e.g. our {{under construction}} template contains a link to the user page of the person who affixed the template, and we can easily find these pages based on a category they get added to by the same template.
The query is very brief, and I have pasted it below. To the best of my understand, it uses indexes and does not cause any USING WHERE or USING FILESORT steps. Nevertheless, the query takes more than 30 minutes on Labs and is killed as a result.
Is there a way to optimize this query further? If not, that is okay; I will implement a two-step query. But I just want to make sure I'm not missing something obvious.
Thanks, Huji
SELECT page_title, pl_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from LEFT JOIN categorylinks c2 ON page_id = c2.cl_from AND c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' WHERE page_namespace = 0 AND pl_namespace IN (2, 3)
Sometimes a subquery will be faster that the LEFT JOIN:
... WHERE page_id NOT IN (SELECT c2.cl_from FROM categorylinks c2 WHERE c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' ...
If that doesn't help, you'll probably have to paginate the query (split it into smaller chunks, then recombine them).
On Sun, Apr 12, 2020 at 08:52 Huji Lee huji.huji@gmail.com wrote:
I have a query that tries to find pages in article namespace that contain a link to some page in the user/user talk namespace. Many of these result from users signing in the articles (which is obviously not appropriate). The query also tries to exclude pages that have a legitimate link to user pages; e.g. our {{under construction}} template contains a link to the user page of the person who affixed the template, and we can easily find these pages based on a category they get added to by the same template.
The query is very brief, and I have pasted it below. To the best of my understand, it uses indexes and does not cause any USING WHERE or USING FILESORT steps. Nevertheless, the query takes more than 30 minutes on Labs and is killed as a result.
Is there a way to optimize this query further? If not, that is okay; I will implement a two-step query. But I just want to make sure I'm not missing something obvious.
Thanks, Huji
SELECT page_title, pl_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from LEFT JOIN categorylinks c2 ON page_id = c2.cl_from AND c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' WHERE page_namespace = 0 AND pl_namespace IN (2, 3) _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud