Hi,
Thanks for reaching out to us, you are definitely asking the questions in the right column. As Dario mentioned, we are working on pageview api which will eventually have support to query pageview counts for all pages belonging to category Foo. But that won't be operational before October 11 but I do want to see if we can help you nonetheless.
My advice would be the following:
3) Run the following queries (I took Medicine as an example category)
Query 1:
#Get all subcategories for category Medicine
SELECT
page.page_title
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'subcat';
It gives a list of all the first-order child categories of the root category 'Medicine'. Now obviously you could traverse further down and get sub-sub categories etc but this is merely to illustrate a minimum approach. (Constructing a category graph is not entirely trivial as you have to consider potential loops).
Query 2:
#Get all pages from category Medicine
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'page';
This query returns all the article titles and their namespace for pages that belong to the 'Medicine' category. This mirrors the second table from https:/en/
wikipedia.org/wiki/Category:Medicine
Query 3:
#Get all pages from category Medicine and it's subcategories
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE cl_to IN (SELECT page.page_title FROM page INNER JOIN categorylinks ON page.page_id = categorylinks.cl_from WHERE cl_to = 'Medicine' AND cl_type = 'subcat')
AND
cl_type = 'page'
AND page.page_namespace = 0;
The final query basically combines query 1 and query 2 and get's a list of article titles that belong either to the category 'Medicine' or one of it's subcategories.
I have attached a csv file with the results of that query. It contains 2809 article titles. I am sure the queries are not dealing with all the edge cases, can be refined etc. but my goal was to illustrate how to tackle your problem using existing tools that are available for all.
4) Finally, you would have to run a simple script and retrieve the pageview numbers for each of 2809 article titles from
stats.grok.se (this you will have to do yourself but a combination of bash, wget and qs should do the trick or write a python / php / ruby script that does this for you).
That's all ;)
We do want to make this feature part of a more general purpose pageview api for which we are discussing the contours on this list. Please chime in with your use-cases!
I hope this will help you to get the data before October 11th.
Best,
Diederik