Query Improvement Question

List overview All Threads
Download

newer

older

Re: [Analytics] Analytics Digest,...

usage of mobile clients

Justin Clark

22 Apr 2016 22 Apr '16

10:57 p.m.

Hi all,

I'm a researcher at the Berkman Center for Internet & Society at Harvard doing some work on anomaly detection against Wikipedia article request volumes.

I'd like to create time series of request volumes for as many article-country pairs as is possible. I'm using a number of different data sets, but the most useful for our purposes is the pageview_hourly table. I understand that this is a tremendous amount of data, and we are in the process of prioritizing the article-country pairs, but my question is: what is the best/fastest way to query this data from hive? Writing a query that gets at the data is not a problem, but I'm curious about possible strategies that could speed up the process.

Here is a reference query that shows the kind of data I'm looking for:

SELECT view_count FROM pageview_hourly WHERE year = 2015 AND month = 1 AND page_title = 'World_War_II' AND country_code = 'CA' AND agent_type = 'user' ORDER BY day, hour;

A couple options that come to mind: * a year > 0 query vs many yearly, monthly, daily, or hourly queries * batching articles with page_title IN (...) * dropping country_code to get all countries at once (or batch like above) * ordering posthoc to avoid the map-reduce overhead

Because there's so much data and a query like the above takes ~10 minutes, experimenting with these is a long process. I was hoping someone more familiar could share any magic that might speed things up (or tell me there's no magic bullet and everything will take about as long as everything else). If no one can say quickly off the top of their head, I can just do that experimentation, but more options to try are totally welcome.

Thanks, Justin

Show replies by date

Nuria Ruiz

24 Apr 24 Apr

1:18 a.m.

Justin,

If you reduce the amount of data you are sampling queries will be faster, instead of sampling one month try an hour.

Thanks,

Nuria

On Fri, Apr 22, 2016 at 3:57 PM, Justin Clark jclark@cyber.law.harvard.edu wrote:

...

Hi all,

I'm a researcher at the Berkman Center for Internet & Society at Harvard doing some work on anomaly detection against Wikipedia article request volumes.

I'd like to create time series of request volumes for as many article-country pairs as is possible. I'm using a number of different data sets, but the most useful for our purposes is the pageview_hourly table. I understand that this is a tremendous amount of data, and we are in the process of prioritizing the article-country pairs, but my question is: what is the best/fastest way to query this data from hive? Writing a query that gets at the data is not a problem, but I'm curious about possible strategies that could speed up the process.

Here is a reference query that shows the kind of data I'm looking for:

SELECT view_count FROM pageview_hourly WHERE year = 2015 AND month = 1 AND page_title = 'World_War_II' AND country_code = 'CA' AND agent_type = 'user' ORDER BY day, hour;

A couple options that come to mind:

a year > 0 query vs many yearly, monthly, daily, or hourly queries

batching articles with page_title IN (...)

dropping country_code to get all countries at once (or batch like

above)

ordering posthoc to avoid the map-reduce overhead

Because there's so much data and a query like the above takes ~10 minutes, experimenting with these is a long process. I was hoping someone more familiar could share any magic that might speed things up (or tell me there's no magic bullet and everything will take about as long as everything else). If no one can say quickly off the top of their head, I can just do that experimentation, but more options to try are totally welcome.

Thanks, Justin

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Joseph Allemandou

25 Apr 25 Apr

8:48 a.m.

Hi Justin, Thanks for your understanding in the amount of data you are getting to :) A few Things to consider:

On Sat, Apr 23, 2016 at 12:57 AM, Justin Clark <jclark@cyber.law.harvard.edu

...

wrote:

...

Hi all,

I'm a researcher at the Berkman Center for Internet & Society at Harvard doing some work on anomaly detection against Wikipedia article request volumes.

I'd like to create time series of request volumes for as many article-country pairs as is possible. I'm using a number of different data sets, but the most useful for our purposes is the pageview_hourly table. I understand that this is a tremendous amount of data, and we are in the process of prioritizing the article-country pairs, but my question is: what is the best/fastest way to query this data from hive? Writing a query that gets at the data is not a problem, but I'm curious about possible strategies that could speed up the process.

Here is a reference query that shows the kind of data I'm looking for:

SELECT view_count FROM pageview_hourly WHERE year = 2015 AND month = 1 AND page_title = 'World_War_II' AND country_code = 'CA' AND agent_type = 'user' ORDER BY day, hour;

A couple options that come to mind:

a year > 0 query vs many yearly, monthly, daily, or hourly queries

batching articles with page_title IN (...)

dropping country_code to get all countries at once (or batch like

above)

ordering posthoc to avoid the map-reduce overhead

Because there's so much data and a query like the above takes ~10 minutes, experimenting with these is a long process. I was hoping someone more familiar could share any magic that might speed things up (or tell me there's no magic bullet and everything will take about as long as everything else). If no one can say quickly off the top of their head, I can just do that experimentation, but more options to try are totally welcome.

Thanks, Justin

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Joseph Allemandou

9:26 a.m.

Again, without misclick sending (sorry for the spam).

Hi Justin,

First, one important thing: the data you are trying to get is VERY sensitive data in term of potential personal information (pageview + country at very granular time level). I'd like to understand better how you'll keep it (how long, what protection, who will have access), and also how you plan to publish about it.

Now that administrative things are said, some technical stuff:

- The hive table is partitioned by year / month / day / hour For testing purposes, having those four values set makes the volume of data to scan much smaller. (~350G per month, therefore ~11G per day, 0.5G per hour) - Whatever request you make, the columns needed for your request (for filtering, group by, or selection) will be scanned In that respect, going for only one query getting all the articles or countries at once data is much cheaper than repeated queries filtering per article or country - Total Ordering costs a lot in that it implies all the data flowing through a single reducer - If you can do without, it'll be cheaper :) - Finally your query has a bug: since there are other dimensions in the table than the one you are filtering by / selecting, you'll have many more rows than expected. You need to group by the selected columns and sum the view_count

Here is how I'd do it (with some time / data size considerations):

INSERT OVERWRITE DIRECTORY "/tmp/joal/test_pageviews_countries" SELECT year, month, day, hour, page_title, country_code, SUM(view_count) as views FROM wmf.pageview_hourly WHERE year = 2016 AND month = 1 AND agent_type = 'user' GROUP BY year, month, day, hour, page_title, country_code ;

I have run this query with more time restrictions for testing purposes (as discussed above).

- For one hour of data (day = 1, hour = 0): - ouput data size ~ 220M - actual CPU time 3mins 2secs - waiting time 66 secs

- For one hour of data (day = 1, hour = 0) with sorting (ORDER BY views DESC LIMIT 1000000000): - ouput data size ~ 220M - actual CPU time 4mins 30secs (overhead 50% of original time, but since it's not parallelizable, would be much more for bigger data) - waiting time 114 secs

- For one day of data (day = 1): - ouput data size ~ 6.5G - actual CPU time 1h 28 mins - waiting time 132 secs

Another thing to consider: one month of the data generated with that query would be about 200G. It will take some time to copy over.

Joseph

...

On Sat, Apr 23, 2016 at 12:57 AM, Justin Clark < jclark@cyber.law.harvard.edu> wrote:

...
Hi all,

I'm a researcher at the Berkman Center for Internet & Society at Harvard doing some work on anomaly detection against Wikipedia article request volumes.

I'd like to create time series of request volumes for as many article-country pairs as is possible. I'm using a number of different data sets, but the most useful for our purposes is the pageview_hourly table. I understand that this is a tremendous amount of data, and we are in the process of prioritizing the article-country pairs, but my question is: what is the best/fastest way to query this data from hive? Writing a query that gets at the data is not a problem, but I'm curious about possible strategies that could speed up the process.

Here is a reference query that shows the kind of data I'm looking for:

SELECT view_count FROM pageview_hourly WHERE year = 2015 AND month = 1 AND page_title = 'World_War_II' AND country_code = 'CA' AND agent_type = 'user' ORDER BY day, hour;

A couple options that come to mind:

a year > 0 query vs many yearly, monthly, daily, or hourly queries

batching articles with page_title IN (...)

dropping country_code to get all countries at once (or batch like

above)

ordering posthoc to avoid the map-reduce overhead

Because there's so much data and a query like the above takes ~10 minutes, experimenting with these is a long process. I was hoping someone more familiar could share any magic that might speed things up (or tell me there's no magic bullet and everything will take about as long as everything else). If no one can say quickly off the top of their head, I can just do that experimentation, but more options to try are totally welcome.

Thanks, Justin

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Joseph Allemandou

26 Apr 26 Apr

7:42 a.m.

Justin, I realized this morning I had forgotten an important point in my example yesterday: the request shown doesn't identify the project to which the pageviews belong. Currently the foundation shelters more than 800 "projects" among which every wikipedia language, every wiktionaty languages, wikidata, wikitravels etc. This piece of data is obviously important when looking at pageviews:

INSERT OVERWRITE DIRECTORY "/tmp/joal/test_pageviews_countries" SELECT year, month, day, hour, project, page_title, country_code, SUM(view_count) as views FROM wmf.pageview_hourly WHERE year = 2016 AND month = 1 AND agent_type = 'user' GROUP BY year, month, day, hour, project, page_title, country_code ;

Best Joseph

On Mon, Apr 25, 2016 at 11:26 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...

Again, without misclick sending (sorry for the spam).

Hi Justin,

First, one important thing: the data you are trying to get is VERY sensitive data in term of potential personal information (pageview + country at very granular time level). I'd like to understand better how you'll keep it (how long, what protection, who will have access), and also how you plan to publish about it.

Now that administrative things are said, some technical stuff:

The hive table is partitioned by year / month / day / hour

For testing purposes, having those four values set makes the volume of data to scan much smaller. (~350G per month, therefore ~11G per day, 0.5G per hour)

Whatever request you make, the columns needed for your request (for

filtering, group by, or selection) will be scanned In that respect, going for only one query getting all the articles or countries at once data is much cheaper than repeated queries filtering per article or country

Total Ordering costs a lot in that it implies all the data flowing

through a single reducer - If you can do without, it'll be cheaper :)

Finally your query has a bug: since there are other dimensions in

the table than the one you are filtering by / selecting, you'll have many more rows than expected. You need to group by the selected columns and sum the view_count

Here is how I'd do it (with some time / data size considerations):

INSERT OVERWRITE DIRECTORY "/tmp/joal/test_pageviews_countries" SELECT year, month, day, hour, page_title, country_code, SUM(view_count) as views FROM wmf.pageview_hourly WHERE year = 2016 AND month = 1 AND agent_type = 'user' GROUP BY year, month, day, hour, page_title, country_code ;

I have run this query with more time restrictions for testing purposes (as discussed above).

For one hour of data (day = 1, hour = 0):

ouput data size ~ 220M

actual CPU time 3mins 2secs

waiting time 66 secs

For one hour of data (day = 1, hour = 0) with sorting (ORDER BY

views DESC LIMIT 1000000000): - ouput data size ~ 220M - actual CPU time 4mins 30secs (overhead 50% of original time, but since it's not parallelizable, would be much more for bigger data) - waiting time 114 secs

For one day of data (day = 1):

ouput data size ~ 6.5G

actual CPU time 1h 28 mins

waiting time 132 secs

Another thing to consider: one month of the data generated with that query would be about 200G. It will take some time to copy over.

Joseph

...
On Sat, Apr 23, 2016 at 12:57 AM, Justin Clark < jclark@cyber.law.harvard.edu> wrote:

...
Hi all,

I'm a researcher at the Berkman Center for Internet & Society at Harvard doing some work on anomaly detection against Wikipedia article request volumes.

I'd like to create time series of request volumes for as many article-country pairs as is possible. I'm using a number of different data sets, but the most useful for our purposes is the pageview_hourly table. I understand that this is a tremendous amount of data, and we are in the process of prioritizing the article-country pairs, but my question is: what is the best/fastest way to query this data from hive? Writing a query that gets at the data is not a problem, but I'm curious about possible strategies that could speed up the process.

Here is a reference query that shows the kind of data I'm looking for:

SELECT view_count FROM pageview_hourly WHERE year = 2015 AND month = 1 AND page_title = 'World_War_II' AND country_code = 'CA' AND agent_type = 'user' ORDER BY day, hour;

A couple options that come to mind:

a year > 0 query vs many yearly, monthly, daily, or hourly queries

batching articles with page_title IN (...)

dropping country_code to get all countries at once (or batch like

above)

ordering posthoc to avoid the map-reduce overhead

Because there's so much data and a query like the above takes ~10 minutes, experimenting with these is a long process. I was hoping someone more familiar could share any magic that might speed things up (or tell me there's no magic bullet and everything will take about as long as everything else). If no one can say quickly off the top of their head, I can just do that experimentation, but more options to try are totally welcome.

Thanks, Justin

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

3179

Age (days ago)

3183

Last active (days ago)

analytics@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Joseph Allemandou
Justin Clark
Nuria Ruiz