[Labs-l] Query optimization
Huji Lee
huji.huji at gmail.com
Thu Mar 10 04:04:03 UTC 2016
@John: I don't know of a way, through SQL, to query the cached contains of
that special page. Is there a way, or did you mean I should copy data from
that special page into a temp table and join from there?
Also, I am afraid most of the top 500 do have counterparts in FA WP. It is
the lower ranks that we are interested in; I guess they fall in top
2000-3000.
@Jaime: The query which I claimed was slow actually never finished on
Quarry; it got killed. The solution you propose (divide and concur) might
be my only option on Quarry. But I was hoping that wouldn't be the case.
I guess I would seize this opportunity to ask two questions:
1) For people like me who don't have SELECT access to the original tables
(what we query is really VIEWs based on the true tables, right?) and
therefore cannot see what fields are indexed, is there a place were we can
refer to to see the details of indexes available to us for each table?
2) Even a simple query like below takes forever to run. Assuming that
ll_from is indexed (which I think is the case), what makes it so expensive?
Shouldn't a simple join on indexed fields + a simple aggregation on an
indexed field be relatively cheap?
page_title, count(cl_to) as CNT from page join categorylinks on page_title
= cl_to where page_namespace = 14 group by cl_to;
If anyone reading this has the ability to run EXPLAIN on the above query
and share the results, that would be fantastic.
On Wed, Mar 9, 2016 at 3:17 AM, Jaime Crespo <jcrespo at wikimedia.org> wrote:
> I would recommend starting with the appropriate Special page:
> [[Special:MostLinkedCategories]]
> That generates the 500 categories with most members and is refreshed
> every few days.
>
> If that wasn't enough, I would do:
>
> SELECT max(cat_id) FROM category;
>
> to get the last cat_id, and then, in small, manageable transactions, run:
>
> SELECT page_title, cat_pages, cat_subcats, page_id
> FROM category JOIN page ON category.cat_title = page.page_title AND
> page.page_namespace = 14
> LEFT JOIN langlinks ON page_id = ll_from and ll_lang = 'fa'
> WHERE cat_id BETWEEN ? and ? and cat_pages > 500 and ll_from is null;
>
>
> where the first 2 "?" are small ranges to cover, in small increments
> (e.g. 10000 records each time), between 1 and max(cat_id). Change
> "cat_pages > 500" to whatever fits you.
>
> Double check the where/on s, not 100% sure they are right, but it
> should be close to what you want.
>
> On Wed, Mar 9, 2016 at 4:38 AM, John <phoenixoverride at gmail.com> wrote:
> > I don't have it handy but category membership counts are stored in the db
> > use that instead of a count()
> >
> >
> > On Tuesday, March 8, 2016, Huji Lee <huji.huji at gmail.com> wrote:
> >>
> >> I am trying to find a list of those categories on EN WP that are highly
> >> used (more than 100 subpages or subcats) but don't have a counterpart
> in FA
> >> WP. The query I used is shown below and also on
> >> https://quarry.wmflabs.org/query/7943 and is extremely slow.
> >>
> >> Any thoughts on how to make it more efficient?
> >>
> >> select page_title, ll_title, count(cl_to) as CNT
> >> from page
> >> left join langlinks
> >> on ll_from = page_id
> >> and ll_lang = 'fa'
> >> join categorylinks
> >> on page_title = cl_to
> >> where page_namespace = 14
> >> and ll_from is null
> >> group by cl_to, ll_title
> >> having CNT > 100
> >> order by CNT desc;
> >>
> >>
> >
> > _______________________________________________
> > Labs-l mailing list
> > Labs-l at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/labs-l
> >
>
>
>
> --
> Jaime Crespo
> <http://wikimedia.org>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20160309/a3af4a62/attachment-0001.html>
More information about the Labs-l
mailing list