Again, without misclick sending (sorry for the spam).

Hi Justin,

First, one important thing: the data you are trying to get is VERY sensitive data in term of potential personal information (pageview + country at very granular time level).
I'd like to understand better how you'll keep it (how long, what protection, who will have access), and also how you plan to publish about it.


Now that administrative things are said, some technical stuff:

Here is how I'd do it (with some time / data size considerations):

INSERT OVERWRITE DIRECTORY "/tmp/joal/test_pageviews_countries"
    SELECT
        year,
        month,
        day,
        hour,
        page_title,
        country_code,
        SUM(view_count) as views
    FROM
        wmf.pageview_hourly
    WHERE
        year = 2016
        AND month = 1
        AND agent_type = 'user'
    GROUP BY
        year,
        month,
        day,
        hour,
        page_title,
        country_code
    ;

I have run this query with more time restrictions for testing purposes (as discussed above).

Another thing to consider: one month of the data generated with that query would be about 200G. It will take some time to copy over.

Joseph




On Sat, Apr 23, 2016 at 12:57 AM, Justin Clark <jclark@cyber.law.harvard.edu> wrote:
Hi all,

I'm a researcher at the Berkman Center for Internet & Society at Harvard doing some work on anomaly detection against Wikipedia article request volumes.

I'd like to create time series of request volumes for as many article-country pairs as is possible. I'm using a number of different data sets, but the most useful for our purposes is the pageview_hourly table. I understand that this is a tremendous amount of data, and we are in the process of prioritizing the article-country pairs, but my question is: what is the best/fastest way to query this data from hive? Writing a query that gets at the data is not a problem, but I'm curious about possible strategies that could speed up the process.

Here is a reference query that shows the kind of data I'm looking for:

SELECT view_count FROM pageview_hourly WHERE
year = 2015 AND
month = 1 AND
page_title = 'World_War_II' AND
country_code = 'CA' AND
agent_type = 'user'
ORDER BY day, hour;

A couple options that come to mind:
  * a year > 0 query vs many yearly, monthly, daily, or hourly queries
  * batching articles with page_title IN (...)
  * dropping country_code to get all countries at once (or batch like above)
  * ordering posthoc to avoid the map-reduce overhead

Because there's so much data and a query like the above takes ~10 minutes, experimenting with these is a long process. I was hoping someone more familiar could share any magic that might speed things up (or tell me there's no magic bullet and everything will take about as long as everything else). If no one can say quickly off the top of their head, I can just do that experimentation, but more options to try are totally welcome.

Thanks,
  Justin

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal



--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal