[Labs-l] long running queries
Jaime Crespo
jcrespo at wikimedia.org
Sat Sep 23 17:50:05 UTC 2017
Sure, I was assuming at least 50 of the links-from-ns-0 would be on
the 5000 list (I haven't checked, though), making the filtering
efforst much faster.
Maybe you or someone else can come up with a mediawiki patch for a new
special page?
On Sat, Sep 23, 2017 at 7:30 PM, John <phoenixoverride at gmail.com> wrote:
> However there is one key difference in the query vs WantedPages if you take
> your first example "HaRav_Moshe_Nehemia_Cohenov" is listed with almost 20k
> incoming links, however if you filter for only ns 0 links you will discover
> that it actually has zero incoming links. Not only was I attempting to only
> get results from ns 0 but also looking to get only ns 0 based links. (Yes I
> know some templates can still taint the results) but it will stop templates
> such as https://en.wikipedia.org/wiki/Template:WikiProject_Israel from
> completely making the report useless. Because of the size of most wiki
> projects, if they add a wanted pages section to their talk page banner the
> validity of WantedPages becomes significantly skewed.
>
> On Sat, Sep 23, 2017 at 1:16 PM, Jaime Crespo <jcrespo at wikimedia.org> wrote:
>>
>> John,
>>
>> Despite your query returning only 50 results, it is unlikely to finish-
>> ever :-)
>>
>> Your query is reading (almost) every single row of pagelinks (1
>> billion rows), -which is also by itself not very optimized-, and
>> joining it with page (35 million rows), and then sorting all results,
>> all without using a proper index, then discarding all of them except
>> the top 50.
>>
>> I would wonder if you have considered using the cached functionality
>> found at https://en.wikipedia.org/wiki/Special:WantedPages but if you
>> absolutely have to run your query? Despite the wrong overriden
>> mediawiki messages show there, those are updated twice a month: This
>> is the same list as shown there on namespace 0:
>>
>> root at dbstore1001[enwiki]> SELECT qc_title as page, qc_value as
>> numlinks FROM querycache WHERE qc_type =
>> 'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50;
>> +--------------------------------------------+----------+
>> | page | numlinks |
>> +--------------------------------------------+----------+
>> | HaRav_Moshe_Nehemia_Cohenov | 19729 |
>> | HaRav_Meir_Shehenboim_(Etz_Hadar) | 19728 |
>> | Wrestling_at_the_2010_South_American_Games | 12659 |
>> | Post-structuralist_ballet | 12025 |
>> | Robert_MacDonnell | 10541 |
>> | Charles_Michael_Lavery_QC | 10540 |
>> | Alison_Campbell | 9747 |
>> | Shauna_Gunn | 9744 |
>> | Sophie_Hoopman | 9742 |
>> | District_Policing_Partnerships | 9741 |
>> | Stanley_Leisure | 9740 |
>> | Robert_Hall_(doctor) | 9739 |
>> | Hunt_Committee | 9738 |
>> | Brian_Baird_(newsreader) | 9738 |
>> | Choo_Chiat_Goh | 7483 |
>> | Michael_Cook_(ballet) | 7480 |
>> | Zoulikha_Naciri | 7127 |
>> | Mounir_Charïbi | 7127 |
>> | Mohamed_Moatassim | 7127 |
>> | Khalil_Hachimi_Idrissi | 7127 |
>> | Hassan_Aourid | 7127 |
>> | Faycal_Laraichi | 7127 |
>> | Campocraspedon | 6982 |
>> | Al-Masdar | 6811 |
>> | Mackay_Radio | 6809 |
>> | Lebanese_literature | 6042 |
>> | Al_Nabad | 6007 |
>> | Badran_Roy_Badran | 6006 |
>> | Mohammad_Abdul_Jabbar_Khairy | 6001 |
>> | Toufic_Awwad | 5999 |
>> | Maurice_Chehab | 5999 |
>> | Ghazi_Aad | 5999 |
>> | Kawasaki_Vulcan_500_LTD | 5724 |
>> | Buddhism_in_Cuba | 5706 |
>> | Thomas_steam_bicycle | 5698 |
>> | Cementerio_de_Santa_Ifigenia | 5696 |
>> | Cuban_Liberation_Army | 5695 |
>> | Fidel_Castro_Díaz-Balart | 5694 |
>> | Consejo_popular_(Cuba) | 5694 |
>> | Moto_Guzzi_V7 | 5687 |
>> | Dirt_Rider | 5676 |
>> | Dick_O'Brien | 5673 |
>> | Motus_Motorcycles | 5671 |
>> | Greenhorn_Enduro | 5671 |
>> | Werner_Motocyclette | 5670 |
>> | Vietnam_Veterans_Motorcycle_Club | 5670 |
>> | Moto_Guzzi_Eldorado | 5670 |
>> | Moto_Guzzi_Ambassador | 5670 |
>> | Bimota_Tesi | 5670 |
>> | Sara_Liberte | 5669 |
>> +--------------------------------------------+----------+
>> 50 rows in set (0.00 sec)
>>
>> Last updated 11 days ago:
>> root at dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE
>> qci_type='Wantedpages';
>> +-------------+----------------+
>> | qci_type | qci_timestamp |
>> +-------------+----------------+
>> | Wantedpages | 20170912074914 |
>> +-------------+----------------+
>> 1 row in set (0.04 sec)
>>
>> This table, querycache, could be replicated to the wiki replicas, but
>> we need someone to help with the sanitization (remove the unwatched
>> pages info).
>>
>> On Sat, Sep 23, 2017 at 2:40 AM, John <phoenixoverride at gmail.com> wrote:
>> > The query I am using in this case is:
>> >
>> > select pl_title, count(*) from pagelinks left join page on pl_title =
>> > page_title where pl_namespace = 0 and pl_from_namespace = 0 and
>> > page_id
>> > IS NULL group by pl_title having COUNT(*) > 100 LIMIT 50;
>> >
>> > I am also running this on enwiki where the table is probably the
>> > largest. I
>> > am attempting to locate red linked pages via number of incoming links. I
>> > would expect it to take an hour or more to run probably, given the size
>> > of
>> > the table. But I am getting the error message in the phab ticket: ERROR
>> > 2006
>> > (HY000): MySQL server has gone away
>> >
>> > On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan
>> > <mviswanathan at wikimedia.org> wrote:
>> >>
>> >> Hi John,
>> >>
>> >> Please provide us with a bit more information so we can help - which
>> >> host
>> >> are you connecting to, what long running queries are you trying to run,
>> >> etc.
>> >> I believe you may be running into some timeout/mechanisms that we have
>> >> set
>> >> up to prevent abuse, see -
>> >> https://phabricator.wikimedia.org/T76956#948591.
>> >>
>> >> As always, feel free to reach out to us on #wikimedia-cloud for any
>> >> questions.
>> >>
>> >> Thanks,
>> >>
>> >> On Fri, Sep 22, 2017 at 1:29 PM, John <phoenixoverride at gmail.com>
>> >> wrote:
>> >>>
>> >>> Do we have a way of letting SQL queries run for a day or so? Im using
>> >>> the
>> >>> interactive sql <host> and running them that way but the DB host is
>> >>> going
>> >>> away in the middle of the process.
>> >>>
>> >>> _______________________________________________
>> >>> Labs-l mailing list
>> >>> Labs-l at lists.wikimedia.org
>> >>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> --Madhu :)
>> >>
>> >> _______________________________________________
>> >> Labs-l mailing list
>> >> Labs-l at lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >>
>> >
>> >
>> > _______________________________________________
>> > Labs-l mailing list
>> > Labs-l at lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >
>>
>>
>>
>> --
>> Jaime Crespo
>> <http://wikimedia.org>
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
--
Jaime Crespo
<http://wikimedia.org>
More information about the Labs-l
mailing list