It would be awesome if you could add a similar (wiki, timestamp) index for exactly the same reasons to all the MultimediaViewerNetworkPerformance* tables on the same database. Those tables haven't been problematic yet because they're a lot smaller, but we might as well make all Multimedia EventLogging queries faster in one swoop.


On Mon, May 12, 2014 at 3:36 PM, Gilles Dubuc <gilles@wikimedia.org> wrote:
Thanks for adding that index, that's exactly what I needed. I'll update the queries to what you suggested now. I actually looked into doing exactly that change earlier today, but couldn't get it to hit the timestamp index. I wasn't specifying a value for "wiki" (it's one of the queries we run, the "global" counts), and it wasn't hitting the index, probably because of NULL values. Now I can force it hit that new index, so everything should be back to sane execution time for those tables.


On Mon, May 12, 2014 at 2:33 PM, Sean Pringle <springle@wikimedia.org> wrote:

On Mon, May 12, 2014 at 8:48 PM, Gilles Dubuc <gilles@wikimedia.org> wrote:
Which ones are your tsvs stuff?

The ones I'm trying to improve are querying MediaViewer_8245578 (just grep for that on the page you've linked to, there will be some things I ran manually today, but the big queries are the problematic ones). The existing queries are full table scans and seem to take around 15 minutes. We run 27 of these every day one after the other. The index I want to add will actually help me break those queries down into several smaller queries. There's no point trying to break them down right now, because I expect that without indexes the simpler queries will take just as long because they'll be full table scans as well.

Ok, think I see the right ones.

If you want to avoid breaking the queries up, push the conditions down onto each base table of the union:

SELECT
...
FROM (
    SELECT timestamp, wiki, event_action FROM MediaViewer_7670440
    WHERE wiki = 'cawiki'
        AND timestamp < TIMESTAMP(CURDATE())
        AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY))
    UNION ALL
    SELECT timestamp, wiki, event_action FROM MediaViewer_8245578
    WHERE wiki = 'cawiki'
        AND timestamp < TIMESTAMP(CURDATE())
        AND timestamp >= TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 30 DAY))
) AS MediaViewerUnioned
GROUP BY datestring ORDER BY datestring ASC;

That will allow a range access on the base tables' timestamp indexes which should reduce the row count a bit. Then if we add an index on (wiki, timestamp), which I'll kick off now, it drops the execution time to ~1min.

UNION ALL avoids the sorting overhead of UNION DISTINCT but unfortunately still materializes the entire set into a temporary table. Pushing the conditions down makes the query more verbose, but perhaps that's still easier than maintaining multiple queries and emulating union.

Sean

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics