<div dir="ltr">However there is one key difference in the query vs WantedPages if you take your first example "HaRav_Moshe_Nehemia_Cohenov" is listed with almost 20k incoming links, however if you filter for only ns 0 links you will discover that it actually has zero incoming links. Not only was I attempting to only get results from ns 0 but also looking to get only ns 0 based links. (Yes I know some templates can still taint the results) but it will stop templates such as <a href="https://en.wikipedia.org/wiki/Template:WikiProject_Israel">https://en.wikipedia.org/wiki/Template:WikiProject_Israel</a> from completely making the report useless. Because of the size of most wiki projects, if they add a wanted pages section to their talk page banner the validity of WantedPages becomes significantly skewed.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Sep 23, 2017 at 1:16 PM, Jaime Crespo <span dir="ltr"><<a href="mailto:jcrespo@wikimedia.org" target="_blank">jcrespo@wikimedia.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">John,<br>
<br>
Despite your query returning only 50 results, it is unlikely to finish- ever :-)<br>
<br>
Your query is reading (almost) every single row of pagelinks (1<br>
billion rows), -which is also by itself not very optimized-, and<br>
joining it with page (35 million rows), and then sorting all results,<br>
all without using a proper index, then discarding all of them except<br>
the top 50.<br>
<br>
I would wonder if you have considered using the cached functionality<br>
found at <a href="https://en.wikipedia.org/wiki/Special:WantedPages" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/<wbr>Special:WantedPages</a> but if you<br>
absolutely have to run your query? Despite the wrong overriden<br>
mediawiki messages show there, those are updated twice a month: This<br>
is the same list as shown there on namespace 0:<br>
<br>
root@dbstore1001[enwiki]> SELECT qc_title as page, qc_value as<br>
numlinks FROM querycache WHERE qc_type =<br>
'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50;<br>
+-----------------------------<wbr>---------------+----------+<br>
| page | numlinks |<br>
+-----------------------------<wbr>---------------+----------+<br>
| HaRav_Moshe_Nehemia_Cohenov | 19729 |<br>
| HaRav_Meir_Shehenboim_(Etz_<wbr>Hadar) | 19728 |<br>
| Wrestling_at_the_2010_South_<wbr>American_Games | 12659 |<br>
| Post-structuralist_ballet | 12025 |<br>
| Robert_MacDonnell | 10541 |<br>
| Charles_Michael_Lavery_QC | 10540 |<br>
| Alison_Campbell | 9747 |<br>
| Shauna_Gunn | 9744 |<br>
| Sophie_Hoopman | 9742 |<br>
| District_Policing_Partnerships | 9741 |<br>
| Stanley_Leisure | 9740 |<br>
| Robert_Hall_(doctor) | 9739 |<br>
| Hunt_Committee | 9738 |<br>
| Brian_Baird_(newsreader) | 9738 |<br>
| Choo_Chiat_Goh | 7483 |<br>
| Michael_Cook_(ballet) | 7480 |<br>
| Zoulikha_Naciri | 7127 |<br>
| Mounir_Charïbi | 7127 |<br>
| Mohamed_Moatassim | 7127 |<br>
| Khalil_Hachimi_Idrissi | 7127 |<br>
| Hassan_Aourid | 7127 |<br>
| Faycal_Laraichi | 7127 |<br>
| Campocraspedon | 6982 |<br>
| Al-Masdar | 6811 |<br>
| Mackay_Radio | 6809 |<br>
| Lebanese_literature | 6042 |<br>
| Al_Nabad | 6007 |<br>
| Badran_Roy_Badran | 6006 |<br>
| Mohammad_Abdul_Jabbar_Khairy | 6001 |<br>
| Toufic_Awwad | 5999 |<br>
| Maurice_Chehab | 5999 |<br>
| Ghazi_Aad | 5999 |<br>
| Kawasaki_Vulcan_500_LTD | 5724 |<br>
| Buddhism_in_Cuba | 5706 |<br>
| Thomas_steam_bicycle | 5698 |<br>
| Cementerio_de_Santa_Ifigenia | 5696 |<br>
| Cuban_Liberation_Army | 5695 |<br>
| Fidel_Castro_Díaz-Balart | 5694 |<br>
| Consejo_popular_(Cuba) | 5694 |<br>
| Moto_Guzzi_V7 | 5687 |<br>
| Dirt_Rider | 5676 |<br>
| Dick_O'Brien | 5673 |<br>
| Motus_Motorcycles | 5671 |<br>
| Greenhorn_Enduro | 5671 |<br>
| Werner_Motocyclette | 5670 |<br>
| Vietnam_Veterans_Motorcycle_<wbr>Club | 5670 |<br>
| Moto_Guzzi_Eldorado | 5670 |<br>
| Moto_Guzzi_Ambassador | 5670 |<br>
| Bimota_Tesi | 5670 |<br>
| Sara_Liberte | 5669 |<br>
+-----------------------------<wbr>---------------+----------+<br>
50 rows in set (0.00 sec)<br>
<br>
Last updated 11 days ago:<br>
root@dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE<br>
qci_type='Wantedpages';<br>
+-------------+---------------<wbr>-+<br>
| qci_type | qci_timestamp |<br>
+-------------+---------------<wbr>-+<br>
| Wantedpages | 20170912074914 |<br>
+-------------+---------------<wbr>-+<br>
1 row in set (0.04 sec)<br>
<br>
This table, querycache, could be replicated to the wiki replicas, but<br>
we need someone to help with the sanitization (remove the unwatched<br>
pages info).<br>
<div class="HOEnZb"><div class="h5"><br>
On Sat, Sep 23, 2017 at 2:40 AM, John <<a href="mailto:phoenixoverride@gmail.com">phoenixoverride@gmail.com</a>> wrote:<br>
> The query I am using in this case is:<br>
><br>
> select pl_title, count(*) from pagelinks left join page on pl_title =<br>
> page_title where pl_namespace = 0 and pl_from_namespace = 0 and page_id<br>
> IS NULL group by pl_title having COUNT(*) > 100 LIMIT 50;<br>
><br>
> I am also running this on enwiki where the table is probably the largest. I<br>
> am attempting to locate red linked pages via number of incoming links. I<br>
> would expect it to take an hour or more to run probably, given the size of<br>
> the table. But I am getting the error message in the phab ticket: ERROR 2006<br>
> (HY000): MySQL server has gone away<br>
><br>
> On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan<br>
> <<a href="mailto:mviswanathan@wikimedia.org">mviswanathan@wikimedia.org</a>> wrote:<br>
>><br>
>> Hi John,<br>
>><br>
>> Please provide us with a bit more information so we can help - which host<br>
>> are you connecting to, what long running queries are you trying to run, etc.<br>
>> I believe you may be running into some timeout/mechanisms that we have set<br>
>> up to prevent abuse, see - <a href="https://phabricator.wikimedia.org/T76956#948591" rel="noreferrer" target="_blank">https://phabricator.wikimedia.<wbr>org/T76956#948591</a>.<br>
>><br>
>> As always, feel free to reach out to us on #wikimedia-cloud for any<br>
>> questions.<br>
>><br>
>> Thanks,<br>
>><br>
>> On Fri, Sep 22, 2017 at 1:29 PM, John <<a href="mailto:phoenixoverride@gmail.com">phoenixoverride@gmail.com</a>> wrote:<br>
>>><br>
>>> Do we have a way of letting SQL queries run for a day or so? Im using the<br>
>>> interactive sql <host> and running them that way but the DB host is going<br>
>>> away in the middle of the process.<br>
>>><br>
>>> ______________________________<wbr>_________________<br>
>>> Labs-l mailing list<br>
>>> <a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
>>> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
>>><br>
>><br>
>><br>
>><br>
>> --<br>
>> --Madhu :)<br>
>><br>
>> ______________________________<wbr>_________________<br>
>> Labs-l mailing list<br>
>> <a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
>> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
>><br>
><br>
><br>
> ______________________________<wbr>_________________<br>
> Labs-l mailing list<br>
> <a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
><br>
<br>
<br>
<br>
--<br>
</div></div><span class="HOEnZb"><font color="#888888">Jaime Crespo<br>
<<a href="http://wikimedia.org" rel="noreferrer" target="_blank">http://wikimedia.org</a>><br>
</font></span><div class="HOEnZb"><div class="h5"><br>
______________________________<wbr>_________________<br>
Labs-l mailing list<br>
<a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>
</div></div></blockquote></div><br></div>