[Labs-l] Tell us about the SQL that you can't get to work
Nuria Ruiz
nuria at wikimedia.org
Tue Dec 6 19:05:08 UTC 2016
>An idea I just had for showing interest is to get Tool Labs
>maintainers and other Labs users to describe questions that they have
>tried and failed to answer using SQL queries.
I think there are two concerns here, correctness and performance.
On the analytics team we tried (early on) to use the labs db to calculate
metrics about data for all wikis since the beginning of time. It was clear
that regarless of correctness of the sql the labs environment could not
sustain that rate of querying daily. We tried for several months different
things but eventually changed our approach. We have since then started
working on the 'data lake' for edit data so you can query with ease (both
in terms of performance and sql) all the editing data for mediawiki since
the beginning of time.
We are building the dataset in hadoop but our plan is (once-redacted, this
is a big task) make that dataset available as a backend for tools such as
quarry. It will greatly simplify the sql users have to write but (perhaps
more importantly) it will make possible to query the data in ways that we
cannot at this time. I think this is a worthy topic to talk about at the
summit.
How denormalized tables look like:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history
On Tue, Dec 6, 2016 at 9:28 AM, Bryan Davis <bd808 at wikimedia.org> wrote:
> In early January there is going to be a Developer Summit in San
> Francisco [0]. Chase and I are in charge of scheduling talks on the
> topic "Building on Wikimedia services: APIs and Developer Resources".
> One of the more interesting to me talks that has been proposed for
> this is "Labsdbs for WMF tools and contributors: get more data,
> faster" by Jamie Crespo [1].
>
> I know that most of you won't be able to attend in person, but if we
> can show that there is enough interest in this topic we can get the
> talk scheduled in a main room and recorded so anyone can watch it
> later.
>
> An idea I just had for showing interest is to get Tool Labs
> maintainers and other Labs users to describe questions that they have
> tried and failed to answer using SQL queries. We can look at the kinds
> of questions that come up and ask Jamie (and others) if there are some
> general recommendations that can be made about how to improve
> performance or understand how the bits and pieces of our data model
> fit together.
>
>
> To kick things off, here's an example I tried to help with over the
> weekend. A Quarry user was adapting a query they had used before to
> find non-redirect File namespace pages not paired with binary files on
> Commons. The query they had come up with was:
>
> SELECT DISTINCT page_title, img_name
> FROM (
> SELECT DISTINCT page_title
> FROM page WHERE page_namespace = 6
> AND page_is_redirect = 0
> ) AS page
> LEFT JOIN (
> SELECT DISTINCT img_name
> FROM image
> ) AS image ON page_title=img_name
> WHERE img_name IS NULL;
>
> The performance of this is horrible for several reasons including the
> excessive use of DISTINCT. The query was consistently killed by the 30
> minute runtime limit. MaxSem and I both came up with about the same
> optimization that eliminated the sub-queries and use of DISTINCT:
>
> SELECT page_title, img_name
> FROM page LEFT OUTER JOIN image ON page_title=img_name
> WHERE page_namespace = 6
> AND page_is_redirect = 0
> AND img_name IS NULL;
>
> This new query is not fast in any sense of the word, but it does
> finish without timing out. There is still some debate about whether
> the 906 rows it returned are correct or not [2].
>
> [0]: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit
> [1]: https://phabricator.wikimedia.org/T149624
> [2]: https://quarry.wmflabs.org/query/14501
>
> Bryan
> --
> Bryan Davis Wikimedia Foundation <bd808 at wikimedia.org>
> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
> irc: bd808 v:415.839.6885 x6855
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20161206/e0f1942c/attachment.html>
More information about the Labs-l
mailing list