Below is an example Hive query yielding the 50 most viewed pages in
India during December 2015. It took less than 10 minutes of wall clock
time to complete.
SELECT
#39;,page_title),
SUM(view_count) AS views
FROM wmf.pageview_hourly
WHERE
year = 2015
AND month = 12
AND country = "India"
AND agent_type = "user"
GROUP BY project, page_title
ORDER BY views DESC LIMIT 50;
...
Total MapReduce CPU Time Spent: 0 days 19 hours 13 minutes 2 seconds 930
msec
OK
_c0 views
182336
Time taken: 562.621 seconds, Fetched: 50 row(s)
See also the discussion at
(As mentioned there, a while ago I retrieved the global top 200 pages
for a timespan of almost six months, with some wait time but no major
issues. It's not quite clear to me why the "brute force" approach
mentioned in the ticket failed, but I guess it had to do with the
difficulty of repeating such a query for all projects - or countries -
to generate top lists for every one of them.)
On Wed, Jan 20, 2016 at 12:42 PM, Kevin Leduc <kevin(a)wikimedia.org> wrote:
+Analytics list so they can comment.
I don't have such a script. It's a pretty intensive job to compile top
articles especially over a month. The pageview API was supposed to have
top
articles per month per wiki but the job is so massive that it failed to
run
in Hive. Analytics knows there are better algorithms out there to solve
this problem. So the pageview API just has top per day per wiki.
I imagine that you are looking at some very specific wikis and
countries...
not all of them. Maybe someone on the list can make an example hive
script
(given a wiki and country) that gives the top for a day.
On Wed, Jan 20, 2016 at 12:23 PM, Dan Foy <dfoy(a)wikimedia.org> wrote:
Hi Kevin,
In your collection of scripts for Hive, do you have one that can act as
a
starting point for me to get the top N articles / URLs for Wikipedia in
a
country?
Thanks,
Dan
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org