<div dir="ltr">However there is one key difference in the query vs WantedPages if you take your first example "HaRav_Moshe_Nehemia_Cohenov" is listed with almost 20k incoming links, however if you filter for only ns 0 links you will discover that it actually has zero incoming links. Not only was I attempting to only get results from ns 0 but also looking to get only ns 0 based links. (Yes I know some templates can still taint the results) but it will stop templates such as <a href="https://en.wikipedia.org/wiki/Template:WikiProject_Israel">https://en.wikipedia.org/wiki/Template:WikiProject_Israel</a> from completely making the report useless. Because of the size of most wiki projects, if they add a wanted pages section to their talk page banner the validity of WantedPages becomes significantly skewed.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Sep 23, 2017 at 1:16 PM, Jaime Crespo <span dir="ltr"><<a href="mailto:jcrespo@wikimedia.org" target="_blank">jcrespo@wikimedia.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">John,<br>
<br>
Despite your query returning only 50 results, it is unlikely to finish- ever :-)<br>
<br>
Your query is reading (almost) every single row of pagelinks (1<br>
billion rows), -which is also by itself not very optimized-, and<br>
joining it with page (35 million rows), and then sorting all results,<br>
all without using a proper index, then discarding all of them except<br>
the top 50.<br>
<br>
I would wonder if you have considered using the cached functionality<br>
found at <a href="https://en.wikipedia.org/wiki/Special:WantedPages" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/<wbr>Special:WantedPages</a> but if you<br>
absolutely have to run your query? Despite the wrong overriden<br>
mediawiki messages show there, those are updated twice a month: This<br>
is the same list as shown there on namespace 0:<br>
<br>
root@dbstore1001[enwiki]> SELECT qc_title as page, qc_value as<br>
numlinks  FROM querycache WHERE qc_type =<br>
'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50;<br>
+-----------------------------<wbr>---------------+----------+<br>
| page                                       | numlinks |<br>
+-----------------------------<wbr>---------------+----------+<br>
| HaRav_Moshe_Nehemia_Cohenov                |    19729 |<br>
| HaRav_Meir_Shehenboim_(Etz_<wbr>Hadar)          |    19728 |<br>
| Wrestling_at_the_2010_South_<wbr>American_Games |    12659 |<br>
| Post-structuralist_ballet                  |    12025 |<br>
| Robert_MacDonnell                          |    10541 |<br>
| Charles_Michael_Lavery_QC                  |    10540 |<br>
| Alison_Campbell                            |     9747 |<br>
| Shauna_Gunn                                |     9744 |<br>
| Sophie_Hoopman                             |     9742 |<br>
| District_Policing_Partnerships             |     9741 |<br>
| Stanley_Leisure                            |     9740 |<br>
| Robert_Hall_(doctor)                       |     9739 |<br>
| Hunt_Committee                             |     9738 |<br>
| Brian_Baird_(newsreader)                   |     9738 |<br>
| Choo_Chiat_Goh                             |     7483 |<br>
| Michael_Cook_(ballet)                      |     7480 |<br>
| Zoulikha_Naciri                            |     7127 |<br>
| Mounir_Charïbi                             |     7127 |<br>
| Mohamed_Moatassim                          |     7127 |<br>
| Khalil_Hachimi_Idrissi                     |     7127 |<br>
| Hassan_Aourid                              |     7127 |<br>
| Faycal_Laraichi                            |     7127 |<br>
| Campocraspedon                             |     6982 |<br>
| Al-Masdar                                  |     6811 |<br>
| Mackay_Radio                               |     6809 |<br>
| Lebanese_literature                        |     6042 |<br>
| Al_Nabad                                   |     6007 |<br>
| Badran_Roy_Badran                          |     6006 |<br>
| Mohammad_Abdul_Jabbar_Khairy               |     6001 |<br>
| Toufic_Awwad                               |     5999 |<br>
| Maurice_Chehab                             |     5999 |<br>
| Ghazi_Aad                                  |     5999 |<br>
| Kawasaki_Vulcan_500_LTD                    |     5724 |<br>
| Buddhism_in_Cuba                           |     5706 |<br>
| Thomas_steam_bicycle                       |     5698 |<br>
| Cementerio_de_Santa_Ifigenia               |     5696 |<br>
| Cuban_Liberation_Army                      |     5695 |<br>
| Fidel_Castro_Díaz-Balart                   |     5694 |<br>
| Consejo_popular_(Cuba)                     |     5694 |<br>
| Moto_Guzzi_V7                              |     5687 |<br>
| Dirt_Rider                                 |     5676 |<br>
| Dick_O'Brien                               |     5673 |<br>
| Motus_Motorcycles                          |     5671 |<br>
| Greenhorn_Enduro                           |     5671 |<br>
| Werner_Motocyclette                        |     5670 |<br>
| Vietnam_Veterans_Motorcycle_<wbr>Club           |     5670 |<br>
| Moto_Guzzi_Eldorado                        |     5670 |<br>
| Moto_Guzzi_Ambassador                      |     5670 |<br>
| Bimota_Tesi                                |     5670 |<br>
| Sara_Liberte                               |     5669 |<br>
+-----------------------------<wbr>---------------+----------+<br>
50 rows in set (0.00 sec)<br>
<br>
Last updated 11 days ago:<br>
root@dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE<br>
qci_type='Wantedpages';<br>
+-------------+---------------<wbr>-+<br>
| qci_type    | qci_timestamp  |<br>
+-------------+---------------<wbr>-+<br>
| Wantedpages | 20170912074914 |<br>
+-------------+---------------<wbr>-+<br>
1 row in set (0.04 sec)<br>
<br>
This table, querycache, could be replicated to the wiki replicas, but<br>
we need someone to help with the sanitization (remove the unwatched<br>
pages info).<br>
<div class="HOEnZb"><div class="h5"><br>
On Sat, Sep 23, 2017 at 2:40 AM, John <<a href="mailto:phoenixoverride@gmail.com">phoenixoverride@gmail.com</a>> wrote:<br>
> The query I am using in this case is:<br>
><br>
> select pl_title, count(*) from pagelinks  left join page on pl_title =<br>
> page_title  where pl_namespace = 0  and pl_from_namespace  = 0 and page_id<br>
> IS NULL  group by pl_title having COUNT(*) > 100 LIMIT 50;<br>
><br>
> I am also running this on enwiki where the table is probably the largest. I<br>
> am attempting to locate red linked pages via number of incoming links. I<br>
> would expect it to take an hour or more to run probably, given the size of<br>
> the table. But I am getting the error message in the phab ticket: ERROR 2006<br>
> (HY000): MySQL server has gone away<br>
><br>
> On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan<br>
> <<a href="mailto:mviswanathan@wikimedia.org">mviswanathan@wikimedia.org</a>> wrote:<br>
>><br>
>> Hi John,<br>
>><br>
>> Please provide us with a bit more information so we can help - which host<br>
>> are you connecting to, what long running queries are you trying to run, etc.<br>
>> I believe you may be running into some timeout/mechanisms that we have set<br>
>> up to prevent abuse, see - <a href="https://phabricator.wikimedia.org/T76956#948591" rel="noreferrer" target="_blank">https://phabricator.wikimedia.<wbr>org/T76956#948591</a>.<br>
>><br>
>> As always, feel free to reach out to us on #wikimedia-cloud for any<br>
>> questions.<br>
>><br>
>> Thanks,<br>
>><br>
>> On Fri, Sep 22, 2017 at 1:29 PM, John <<a href="mailto:phoenixoverride@gmail.com">phoenixoverride@gmail.com</a>> wrote:<br>
>>><br>
>>> Do we have a way of letting SQL queries run for a day or so? Im using the<br>
>>> interactive sql <host> and running them that way but the DB host is going<br>
>>> away in the middle of the process.<br>
>>><br>
>>> ______________________________<wbr>_________________<br>
>>> Labs-l mailing list<br>
>>> <a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
>>> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
>>><br>
>><br>
>><br>
>><br>
>> --<br>
>> --Madhu :)<br>
>><br>
>> ______________________________<wbr>_________________<br>
>> Labs-l mailing list<br>
>> <a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
>> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
>><br>
><br>
><br>
> ______________________________<wbr>_________________<br>
> Labs-l mailing list<br>
> <a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
><br>
<br>
<br>
<br>
--<br>
</div></div><span class="HOEnZb"><font color="#888888">Jaime Crespo<br>
<<a href="http://wikimedia.org" rel="noreferrer" target="_blank">http://wikimedia.org</a>><br>
</font></span><div class="HOEnZb"><div class="h5"><br>
______________________________<wbr>_________________<br>
Labs-l mailing list<br>
<a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
</div></div></blockquote></div><br></div>