<div dir="ltr">><span style="font-size:12.8px">An idea I just had for showing interest is to get Tool Labs</span><br style="font-size:12.8px"><span style="font-size:12.8px">>maintainers and other Labs users to describe questions that they have</span><br style="font-size:12.8px"><span style="font-size:12.8px">>tried and failed to answer using SQL queries.</span><div><span style="font-size:12.8px"><br></span><div>I think there are two concerns here, correctness and performance. </div><div>On the analytics team we tried (early on) to use the labs db to calculate metrics about data for all wikis since the beginning of time. It was clear that regarless of correctness of the sql the labs environment could not sustain that rate of querying daily.  We tried for several months different things but eventually changed our approach. We have since then started working on the 'data lake' for edit data so you can query with ease (both in terms of performance and sql) all the editing data for mediawiki since the beginning of time.</div><div><br></div><div>We are building the dataset in hadoop but our plan is (once-redacted, this is a big task) make that dataset available as a backend for tools such as quarry. It will greatly simplify the sql users have to write but (perhaps more importantly) it will make possible to query the data in ways that we cannot at this time. I think this is a worthy topic to talk about at the summit.</div><div><br></div><div>How denormalized tables look like:</div><div><a href="https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history">https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history</a><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 6, 2016 at 9:28 AM, Bryan Davis <span dir="ltr"><<a href="mailto:bd808@wikimedia.org" target="_blank">bd808@wikimedia.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">In early January there is going to be a Developer Summit in San<br>

Francisco [0]. Chase and I are in charge of scheduling talks on the<br>

topic "Building on Wikimedia services: APIs and Developer Resources".<br>

One of the more interesting to me talks that has been proposed for<br>

this is "Labsdbs for WMF tools and contributors: get more data,<br>

faster" by Jamie Crespo [1].<br>

<br>

I know that most of you won't be able to attend in person, but if we<br>

can show that there is enough interest in this topic we can get the<br>

talk scheduled in a main room and recorded so anyone can watch it<br>

later.<br>

<br>

An idea I just had for showing interest is to get Tool Labs<br>

maintainers and other Labs users to describe questions that they have<br>

tried and failed to answer using SQL queries. We can look at the kinds<br>

of questions that come up and ask Jamie (and others) if there are some<br>

general recommendations that can be made about how to improve<br>

performance or understand how the bits and pieces of our data model<br>

fit together.<br>

<br>

<br>

To kick things off, here's an example I tried to help with over the<br>

weekend. A Quarry user was adapting a query they had used before to<br>

find non-redirect File namespace pages not paired with binary files on<br>

Commons. The query they had come up with was:<br>

<br>

  SELECT DISTINCT page_title, img_name<br>

  FROM (<br>

      SELECT DISTINCT page_title<br>

      FROM page WHERE page_namespace = 6<br>

       AND page_is_redirect = 0<br>

    ) AS page<br>

    LEFT JOIN (<br>

      SELECT DISTINCT img_name<br>

      FROM image<br>

    ) AS image ON page_title=img_name<br>

  WHERE img_name IS NULL;<br>

<br>

The performance of this is horrible for several reasons including the<br>

excessive use of DISTINCT. The query was consistently killed by the 30<br>

minute runtime limit. MaxSem and I both came up with about the same<br>

optimization that eliminated the sub-queries and use of DISTINCT:<br>

<br>

  SELECT page_title, img_name<br>

  FROM page LEFT OUTER JOIN image ON page_title=img_name<br>

  WHERE page_namespace = 6<br>

    AND page_is_redirect = 0<br>

    AND img_name IS NULL;<br>

<br>

This new query is not fast in any sense of the word, but it does<br>

finish without timing out. There is still some debate about whether<br>

the 906 rows it returned are correct or not [2].<br>

<br>

[0]: <a href="https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit" rel="noreferrer" target="_blank">https://www.mediawiki.org/<wbr>wiki/Wikimedia_Developer_<wbr>Summit</a><br>

[1]: <a href="https://phabricator.wikimedia.org/T149624" rel="noreferrer" target="_blank">https://phabricator.wikimedia.<wbr>org/T149624</a><br>

[2]: <a href="https://quarry.wmflabs.org/query/14501" rel="noreferrer" target="_blank">https://quarry.wmflabs.org/<wbr>query/14501</a><br>

<span class="HOEnZb"><font color="#888888"><br>

Bryan<br>

--<br>

Bryan Davis              Wikimedia Foundation    <<a href="mailto:bd808@wikimedia.org">bd808@wikimedia.org</a>><br>

[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA<br>

irc: bd808                                        v:<a href="tel:415.839.6885%20x6855" value="+14158396885">415.839.6885 x6855</a><br>

<br>

______________________________<wbr>_________________<br>

Labs-l mailing list<br>

<a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>

<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" rel="noreferrer" target="_blank">https://lists.wikimedia.org/<wbr>mailman/listinfo/labs-l</a><br>

</font></span></blockquote></div><br></div>