Forwarding to Analytics, Research, and Wikimetrics in case this is of interest to people who aren't subscribed to the Labs mailing list.
Pine
---------- Forwarded message ---------- From: Bryan Davis bd808@wikimedia.org Date: Tue, Dec 6, 2016 at 9:28 AM Subject: [Labs-l] Tell us about the SQL that you can't get to work To: labs-l labs-l@lists.wikimedia.org
In early January there is going to be a Developer Summit in San Francisco [0]. Chase and I are in charge of scheduling talks on the topic "Building on Wikimedia services: APIs and Developer Resources". One of the more interesting to me talks that has been proposed for this is "Labsdbs for WMF tools and contributors: get more data, faster" by Jamie Crespo [1].
I know that most of you won't be able to attend in person, but if we can show that there is enough interest in this topic we can get the talk scheduled in a main room and recorded so anyone can watch it later.
An idea I just had for showing interest is to get Tool Labs maintainers and other Labs users to describe questions that they have tried and failed to answer using SQL queries. We can look at the kinds of questions that come up and ask Jamie (and others) if there are some general recommendations that can be made about how to improve performance or understand how the bits and pieces of our data model fit together.
To kick things off, here's an example I tried to help with over the weekend. A Quarry user was adapting a query they had used before to find non-redirect File namespace pages not paired with binary files on Commons. The query they had come up with was:
SELECT DISTINCT page_title, img_name FROM ( SELECT DISTINCT page_title FROM page WHERE page_namespace = 6 AND page_is_redirect = 0 ) AS page LEFT JOIN ( SELECT DISTINCT img_name FROM image ) AS image ON page_title=img_name WHERE img_name IS NULL;
The performance of this is horrible for several reasons including the excessive use of DISTINCT. The query was consistently killed by the 30 minute runtime limit. MaxSem and I both came up with about the same optimization that eliminated the sub-queries and use of DISTINCT:
SELECT page_title, img_name FROM page LEFT OUTER JOIN image ON page_title=img_name WHERE page_namespace = 6 AND page_is_redirect = 0 AND img_name IS NULL;
This new query is not fast in any sense of the word, but it does finish without timing out. There is still some debate about whether the 906 rows it returned are correct or not [2].
[0]: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit [1]: https://phabricator.wikimedia.org/T149624 [2]: https://quarry.wmflabs.org/query/14501
Bryan -- Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
_______________________________________________ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l
wikimetrics@lists.wikimedia.org