Re: [Analytics] Query Improvement Question

25 Apr 2016

Hi Justin,
Thanks for your understanding in the amount of data you are getting to :)
A few Things to consider:

   - The hive table is partitioned by year / month / day / hourFor testing
   purposes, having those four values set makes the volume of data to scan
   much smaller. (~350G per month, therefore ~11G per day, 0.5G per hour)

   -

On Sat, Apr 23, 2016 at 12:57 AM, Justin Clark &lt;jclark(a)cyber.law.harvard.edu
...
  wrote: 
...
  Hi all,

 I'm a researcher at the Berkman Center for Internet & Society at Harvard
 doing some work on anomaly detection against Wikipedia article request
 volumes.

 I'd like to create time series of request volumes for as many
 article-country pairs as is possible. I'm using a number of different data
 sets, but the most useful for our purposes is the pageview_hourly table. I
 understand that this is a tremendous amount of data, and we are in the
 process of prioritizing the article-country pairs, but my question is: what
 is the best/fastest way to query this data from hive? Writing a query that
 gets at the data is not a problem, but I'm curious about possible
 strategies that could speed up the process.

 Here is a reference query that shows the kind of data I'm looking for:

 SELECT view_count FROM pageview_hourly WHERE
 year = 2015 AND
 month = 1 AND
 page_title = 'World_War_II' AND
 country_code = 'CA' AND
 agent_type = 'user'
 ORDER BY day, hour;

 A couple options that come to mind:
   * a year > 0 query vs many yearly, monthly, daily, or hourly queries
   * batching articles with page_title IN (...)
   * dropping country_code to get all countries at once (or batch like
 above)
   * ordering posthoc to avoid the map-reduce overhead

 Because there's so much data and a query like the above takes ~10 minutes,
 experimenting with these is a long process. I was hoping someone more
 familiar could share any magic that might speed things up (or tell me
 there's no magic bullet and everything will take about as long as
 everything else). If no one can say quickly off the top of their head, I
 can just do that experimentation, but more options to try are totally
 welcome.

 Thanks,
   Justin

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Query Improvement Question