Again, without misclick sending (sorry for the spam).
Hi Justin,
First, one important thing: the data you are trying to get is VERY sensitive data in term of potential personal information (pageview + country at very granular time level).
I'd like to understand better how you'll keep it (how long, what protection, who will have access), and also how you plan to publish about it.
Now that administrative things are said, some technical stuff:
- The hive table is partitioned by year / month / day / hour
For testing purposes, having those four values set makes the volume of data to scan much smaller. (~350G per month, therefore ~11G per day, 0.5G per hour) - Whatever request you make, the columns needed for your request (for filtering, group by, or selection) will be scanned
In that respect, going for only one query getting all the articles or countries at once data is much cheaper than repeated queries filtering per article or country - Total Ordering costs a lot in that it implies all the data flowing through a single reducer - If you can do without, it'll be cheaper :)
- Finally your query has a bug: since there are other dimensions in the table than the one you are filtering by / selecting, you'll have many more rows than expected.
You need to group by the selected columns and sum the view_count
Here is how I'd do it (with some time / data size considerations):
INSERT OVERWRITE DIRECTORY "/tmp/joal/test_pageviews_countries"
SELECT
year,
month,
day,
hour,
page_title,
country_code,
SUM(view_count) as views
FROM
wmf.pageview_hourly
WHERE
year = 2016
AND month = 1
AND agent_type = 'user'
GROUP BY
year,
month,
day,
hour,
page_title,
country_code
;
I have run this query with more time restrictions for testing purposes (as discussed above).
- For one hour of data (day = 1, hour = 0):
- ouput data size ~ 220M
- actual CPU time 3mins 2secs
- waiting time 66 secs
- For one hour of data (day = 1, hour = 0) with sorting (ORDER BY views DESC LIMIT 1000000000):
- ouput data size ~ 220M
- actual CPU time 4mins 30secs (overhead 50% of original time, but since it's not parallelizable, would be much more for bigger data)
- waiting time 114 secs
- For one day of data (day = 1):
- ouput data size ~ 6.5G
- actual CPU time 1h 28 mins
- waiting time 132 secs
Another thing to consider: one month of the data generated with that query would be about 200G. It will take some time to copy over.
Joseph