Hi,

Thanks for reaching out to us, you are definitely asking the questions in the right column. As Dario mentioned, we are working on pageview api which will eventually have support to query pageview counts for all pages belonging to category Foo. But that won't be operational before October 11 but I do want to see if we can help you nonetheless.

My advice would be the following:

1) Create an account on https://wikitech.wikimedia.org/wiki/Main_Page
2) Request access (https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request)  to be part of the tools-lab project -- this project offers access to redacted mirrors of the actual production mysql databases.
3) Run the following queries (I took Medicine as an example category)

Query 1:
#Get all subcategories for category Medicine
SELECT 
page.page_title 
FROM 
page 
INNER JOIN 
categorylinks 
ON 
page.page_id = categorylinks.cl_from 
WHERE 
cl_to IN ('Medicine') 
AND 
cl_type = 'subcat';


This will give you the result as can be found on https:/en/wikipedia.org/wiki/Category:Medicine
It gives a list of all the first-order child categories of the root category 'Medicine'. Now obviously you could traverse further down and get sub-sub categories etc but this is merely to illustrate a minimum approach. (Constructing a category graph is not entirely trivial as you have to consider potential loops).

Query 2:
#Get all pages from category Medicine
SELECT 
page.page_title,
page.page_namespace
FROM 
page 
INNER JOIN 
categorylinks 
ON 
page.page_id = categorylinks.cl_from 
WHERE 
cl_to IN ('Medicine') 
AND 
cl_type = 'page';

This query returns all the article titles and their namespace for pages that belong to the 'Medicine' category. This mirrors the second table from https:/en/wikipedia.org/wiki/Category:Medicine


Query 3:
#Get all pages from category Medicine and it's subcategories
SELECT 
page.page_title,
page.page_namespace
FROM 
page 
INNER JOIN 
categorylinks 
ON 
page.page_id = categorylinks.cl_from 
WHERE cl_to IN (SELECT page.page_title FROM page INNER JOIN categorylinks ON page.page_id = categorylinks.cl_from WHERE cl_to = 'Medicine' AND cl_type = 'subcat') 
AND 
cl_type = 'page'
AND page.page_namespace = 0;

The final query basically combines query 1 and query 2 and get's a list of article titles that belong either to the category 'Medicine' or one of it's subcategories.
I have attached a csv file with the results of that query. It contains 2809 article titles. I am sure the queries are not dealing with all the edge cases, can be refined etc. but my goal was to illustrate how to tackle your problem using existing tools that are available for all. 


4) Finally, you would have to run a simple script and retrieve the pageview numbers for each of 2809 article titles from stats.grok.se (this you will have to do yourself but a combination of  bash, wget and qs should do the trick or write a python / php / ruby script that does this for you).

That's all ;) 


We do want to make this feature part of a more general purpose pageview api for which we are discussing the contours on this list. Please chime in with your use-cases!

I hope this will help you to get the data before October 11th. 



Best,

Diederik





On Thu, Oct 3, 2013 at 11:52 AM, Lane Rasberry <lane@bluerasberry.com> wrote:
Hello Wikipedia data enthusiasts!

My name is Lane Rasberry, user:bluerasberry, and I contribute to health content on English Wikipedia. I am writing to ask for help from WMF people and community allies in drafting and backing with evidence a statement for publication in a medical journal. The statement that I would like to make is something like this:

"The amount of traffic received by health articles on Wikipedia makes Wikipedia a significant source of health information."

When I make this statement, I would like to be able to do so as clearly as possible and in a way that is backed by authentication by the Wikimedia Foundation and probably a bit of data, perhaps in the form of a comparison with traffic to another health website. I happen to work for the US-based non-profit organization Consumer Reports, and we have thought about comparing Wikipedia's traffic with WebMD's traffic, as WebMD is sometimes reported as being the most popular source of health information online or in the world. At Consumer Reports we get traffic data from Nielsen, so that would be the source for comparison data.
<https://en.wikipedia.org/wiki/Nielsen_Holdings>

I need help from other stakeholders from this because if this article is published - and this is not unlikely because it was requested of me - then it could be cited by other people doing outreach as supporting evidence of the impact and worthiness of developing Wikimedia content related to health. Even if it is not published in this instance the increasing media attention which Wikipedia health content is getting merits having some verified statement to share about traffic.

I wrote more about why I need this statement and how it can be reused at
<https://meta.wikimedia.org/wiki/Wiki_Project_Med/traffic>

I am writing some individuals in addition to sending this to mailing lists for the following reasons:
  • Dario and Jonathan Morgan, you both are Wikimedia data people and I have talked with you both about this directly
  • Erik Zachte, I talked with you about this generally in Feb 2013
  • Doc James, we both say that Wikipedia health content is popular but neither of us do this with authenticated data
  • Jake Orlowitz, you are managing Wikipedia's relationship with the Cochrane Collaboration and they also are partnering with the Wikipedia health community on the premise that traffic matters
  • AnthonyhCole, you were asking me for my opinion about what I think the WMF could do to support people doing Wikipedia outreach in health. I think lots of people would find a statement about traffic to health articles useful.
  • Matthew Roth, you manage communications at the Wikimedia Foundation and if you want any input into what I am doing then I would love to have your advice.
I need this soon - perhaps by October 11? Is that possible? How much work would it be to make this statement? Can someone with the WMF Analytics Team and WMF communications help me? Am I in the right forums?

Thanks,

--
Lane Rasberry
user:bluerasberry on Wikipedia

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics