Sean,

First, thanks for killing those queries.

Second, I've cc-ed mobile tech to talk about this problem.  This query is running periodically on a scheduler.  It is re-computing all data every time it runs.  The mobile folks developed logic in the scheduler to time-box some queries, but problems with Event Logging data and the incredibly frequent need to re-run after data outages and backfills has left this kind of monster query as the only sane option.  So our current reality is that the only sane option is insane.

Indexing these tables to make this insane query run is like the old "Why don't you turn in your crazy brother who thinks he's a chicken?  I would, but I need the eggs." joke.  Yes we do need the eggs, but no we're not going to get them from this query.

Marcel wrote a new scheduler that handles failures much better and is much easier to work with in terms of re-runs.  That's the only option to keep this kind of data generation alive.  Until we get that running, we are going to cancel this and other non-timeboxed queries.  This patch: https://gerrit.wikimedia.org/r/#/c/210364/ has mobile folks on it and Sean.  Sean, if this problem happens again before we take action, just merge this patch and that will stop the bleeding.

In the long run, we are trying to migrate this data to a scalable analytics platform where you can run queries like this and the system will cleverly re-compute only if the underlying data changes.  Druid, Pipeline DB, etc. are candidates.  The intermediate step to that will be moving the Event Logging pipeline to Kafka and Hadoop, which we are in the process of doing right now.
 

On Tue, May 12, 2015 at 2:12 AM, Sean Pringle <springle@wikimedia.org> wrote:
I just killed 100+ 3-day unindexed research queries on dbstore1002.
All replication streams were lagging by nearly 1 day, and /tmp was
hundreds of GB.

This seems similar to the problem from ~2 weeks ago, when old /tmp did
fill up. The queries were of the following form (but with some
variation). We need some indexing scheme for  MobileWebEditing*
tables, or to come up with a new approach.

SELECT
    Month.Date,
    COALESCE(Web.Web, 0) AS Web

-- http://stackoverflow.com/a/6871220/365238
-- ... using MariaDB 10 SEQUENCE engine instead of information_schema.columns
FROM (
    SELECT DATE_FORMAT(
        ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY, @num:=@num+1),
        '%Y-%m-%d'
    ) AS Date
    FROM seq_1_to_100, (SELECT @num:=-1) num LIMIT 30
) AS Month
LEFT JOIN (
    SELECT
        DATE(timestamp) AS Date,
        SUM(1) AS Web

    FROM (SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount FROM MobileWebEditing_5644223
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount FROM MobileWebEditing_6077315
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount from MobileWebEditing_6637866
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount from MobileWebEditing_7675117
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount from MobileWebEditing_8599025) as
MobileWebEditing WHERE
        event_action = 'error' AND
        wiki != 'testwiki'
    GROUP BY Date
) AS Web ON Month.Date = Web.Date;

EXPLAIN:

+------+--------------+--------------------------+--------+---------------+---------+---------+------------+----------+----------------------------------------------+
| id   | select_type  | table                    | type   |
possible_keys | key     | key_len | ref        | rows     | Extra
                                  |
+------+--------------+--------------------------+--------+---------------+---------+---------+------------+----------+----------------------------------------------+
|    1 | PRIMARY      | <derived2>               | ALL    | NULL
   | NULL    | NULL    | NULL       |       30 |
                       |
|    1 | PRIMARY      | <derived4>               | ref    | key0
   | key0    | 4       | Month.Date |   563154 | Using where
                       |
|    4 | DERIVED      | <derived5>               | ALL    | NULL
   | NULL    | NULL    | NULL       | 56315405 | Using where; Using
temporary; Using filesort |
|    5 | DERIVED      | MobileWebEditing_5644223 | ALL    | NULL
   | NULL    | NULL    | NULL       |  1152600 |
                       |
|    6 | UNION        | MobileWebEditing_6077315 | ALL    | NULL
   | NULL    | NULL    | NULL       |   685212 |
                       |
|    7 | UNION        | MobileWebEditing_6637866 | ALL    | NULL
   | NULL    | NULL    | NULL       |  1528269 |
                       |
|    8 | UNION        | MobileWebEditing_7675117 | ALL    | NULL
   | NULL    | NULL    | NULL       |  1663281 |
                       |
|    9 | UNION        | MobileWebEditing_8599025 | ALL    | NULL
   | NULL    | NULL    | NULL       | 51286043 |
                       |
| NULL | UNION RESULT | <union5,6,7,8,9>         | ALL    | NULL
   | NULL    | NULL    | NULL       |     NULL |
                       |
|    2 | DERIVED      | <derived3>               | system | NULL
   | NULL    | NULL    | NULL       |        1 |
                       |
|    2 | DERIVED      | seq_1_to_100             | index  | NULL
   | PRIMARY | 8       | NULL       |      100 | Using index
                       |
|    3 | DERIVED      | NULL                     | NULL   | NULL
   | NULL    | NULL    | NULL       |     NULL | No tables used
                       |
+------+--------------+--------------------------+--------+---------------+---------+---------+------------+----------+----------------------------------------------+


---
DBA @ WMF

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics