Justin,
If you reduce the amount of data you are sampling queries will be faster,
instead of sampling one month try an hour.
Thanks,
Nuria
On Fri, Apr 22, 2016 at 3:57 PM, Justin Clark <jclark(a)cyber.law.harvard.edu>
wrote:
Hi all,
I'm a researcher at the Berkman Center for Internet & Society at Harvard
doing some work on anomaly detection against Wikipedia article request
volumes.
I'd like to create time series of request volumes for as many
article-country pairs as is possible. I'm using a number of different data
sets, but the most useful for our purposes is the pageview_hourly table. I
understand that this is a tremendous amount of data, and we are in the
process of prioritizing the article-country pairs, but my question is: what
is the best/fastest way to query this data from hive? Writing a query that
gets at the data is not a problem, but I'm curious about possible
strategies that could speed up the process.
Here is a reference query that shows the kind of data I'm looking for:
SELECT view_count FROM pageview_hourly WHERE
year = 2015 AND
month = 1 AND
page_title = 'World_War_II' AND
country_code = 'CA' AND
agent_type = 'user'
ORDER BY day, hour;
A couple options that come to mind:
* a year > 0 query vs many yearly, monthly, daily, or hourly queries
* batching articles with page_title IN (...)
* dropping country_code to get all countries at once (or batch like
above)
* ordering posthoc to avoid the map-reduce overhead
Because there's so much data and a query like the above takes ~10 minutes,
experimenting with these is a long process. I was hoping someone more
familiar could share any magic that might speed things up (or tell me
there's no magic bullet and everything will take about as long as
everything else). If no one can say quickly off the top of their head, I
can just do that experimentation, but more options to try are totally
welcome.
Thanks,
Justin
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics