[Labs-l] Query optimization

Purodha Blissenbach purodha at blissenbach.org
Thu Mar 10 08:24:54 UTC 2016


I do not know how replication works, but I did and do assume, it
replicates indexes. If so, the online manual starting at
https://www.mediawiki.org/wiki/Manual:Database_layout
may help you.

Purodha

On 10.03.2016 05:04, Huji Lee wrote:
> @John: I dont know of a way, through SQL, to query the cached 
> contains
> of that special page. Is there a way, or did you mean I should copy
> data from that special page into a temp table and join from there?
>
> Also, I am afraid most of the top 500 do have counterparts in FA WP.
> It is the lower ranks that we are interested in; I guess they fall in
> top 2000-3000.
>
> @Jaime: The query which I claimed was slow actually never finished on
> Quarry; it got killed. The solution you propose (divide and concur)
> might be my only option on Quarry. But I was hoping that wouldnt be
> the case.
>
> I guess I would seize this opportunity to ask two questions:
>
> 1) For people like me who dont have SELECT access to the original
> tables (what we query is really VIEWs based on the true tables,
> right?) and therefore cannot see what fields are indexed, is there a
> place were we can refer to to see the details of indexes available to
> us for each table?
>
> 2) Even a simple query like below takes forever to run. Assuming that
> ll_from is indexed (which I think is the case), what makes it so
> expensive? Shouldnt a simple join on indexed fields + a simple
> aggregation on an indexed field be relatively cheap?
>
> page_title, count(cl_to) as CNT from page join categorylinks on
> page_title = cl_to where page_namespace = 14 group by cl_to;
>
> If anyone reading this has the ability to run EXPLAIN on the above
> query and share the results, that would be fantastic.
>
> On Wed, Mar 9, 2016 at 3:17 AM, Jaime Crespo  wrote:
>
>> I would recommend starting with the appropriate Special page:
>> [[Special:MostLinkedCategories]]
>> That generates the 500 categories with most members and is
>> refreshed
>> every few days.
>>
>> If that wasnt enough, I would do:
>>
>>  SELECT max(cat_id) FROM category;
>>
>> to get the last cat_id, and then, in small, manageable
>> transactions, run:
>>
>>  SELECT page_title, cat_pages, cat_subcats, page_id
>>  FROM category JOIN page ON category.cat_title = page.page_title
>> AND
>> page.page_namespace = 14
>>  LEFT JOIN langlinks ON page_id = ll_from and ll_lang = fa
>>  WHERE cat_id BETWEEN ? and ? and cat_pages > 500 and ll_from is
>> null;
>>
>> where the first 2 "?" are small ranges to cover, in small
>> increments
>> (e.g. 10000 records each time), between 1 and max(cat_id). Change
>> "cat_pages > 500" to whatever fits you.
>>
>> Double check the where/on s, not 100% sure they are right, but it
>> should be close to what you want.
>>
>> On Wed, Mar 9, 2016 at 4:38 AM, John wrote:
>> > I dont have it handy but category membership counts are stored in
>> the db
>> > use that instead of a count()
>> >
>> >
>> > On Tuesday, March 8, 2016, Huji Lee wrote:
>> >>
>> >> I am trying to find a list of those categories on EN WP that are
>> highly
>> >> used (more than 100 subpages or subcats) but dont have a
>> counterpart in FA
>> >> WP. The query I used is shown below and also on
>> >> https://quarry.wmflabs.org/query/7943 [3] and is extremely slow.
>> >>
>> >> Any thoughts on how to make it more efficient?
>> >>
>> >> select page_title, ll_title, count(cl_to) as CNT
>> >> from page
>> >> left join langlinks
>> >> on ll_from = page_id
>> >> and ll_lang = fa
>> >> join categorylinks
>> >> on page_title = cl_to
>> >> where page_namespace = 14
>> >> and ll_from is null
>> >> group by cl_to, ll_title
>> >> having CNT > 100
>> >> order by CNT desc;
>> >>
>> >>
>> >
>> > _______________________________________________
>> > Labs-l mailing list
>> > Labs-l at lists.wikimedia.org [4]
>> > https://lists.wikimedia.org/mailman/listinfo/labs-l [5]
>> >
>>
>> --
>> Jaime Crespo
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org [7]
>> https://lists.wikimedia.org/mailman/listinfo/labs-l [8]
>
>
>
> Links:
> ------
> [1] mailto:phoenixoverride at gmail.com
> [2] mailto:huji.huji at gmail.com
> [3] https://quarry.wmflabs.org/query/7943
> [4] mailto:Labs-l at lists.wikimedia.org
> [5] https://lists.wikimedia.org/mailman/listinfo/labs-l
> [6] http://wikimedia.org
> [7] mailto:Labs-l at lists.wikimedia.org
> [8] https://lists.wikimedia.org/mailman/listinfo/labs-l
> [9] mailto:jcrespo at wikimedia.org




More information about the Labs-l mailing list