Hi Lane,
Did you see these reports?
Here is a category tree below category 'Health' on English Wikipedia (with some out-of-context sub branches blacklisted).
http://stats.wikimedia.org/wikimedia/pageviews/categorized/wp-en/2013-07/cat egories_wp-en_cat_Health_2013-07.html
Here are the page views for articles in all those categories:
Warning the list is overly complete by design:
Some top ranking titles in this list may seem out of place.
Please note that any Wikipedia article can have tens of categories assigned to it.
A popular article will rank high in any list where it's featured, regardless of the category under review.
Thus a well-known singer may be top ranking in a list about politicians, because he/she also played a minor or brief role in politics.
Iterative pruning of the category tree will yield better results. Now you have to do final filtering yourself.
http://stats.wikimedia.org/wikimedia/pageviews/categorized/wp-en/2013-07/pag eviews_wp-en_cat_Health_2013-07.html
New insight:
Instead of using the category hierarchy, article lists from WikiProjects would yield cleaner results, and would suffice for many purposes, notably yours :-)
Cheers,
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Lane Rasberry Sent: Friday, October 04, 2013 4:39 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Cc: Jake Orlowitz; Anthony Cole; James Heilman; Wiki Medicine discussion; Matthew Roth Subject: Re: [Analytics] need traffic data for health content...
Thanks Diederik and Dan,
We at WikiProject Medicine already have some simple data based on grok.se reports. Unless the system you guys are sharing reports traffic greatly less than what grok.se shows, then I already have an idea of what I want to say. If I followed your instructions, got some numbers, and I drew a conclusion, would the analytics team and the Wikimedia Foundation be comfortable with whatever I said and validate my assertions? What if I said -
"Judging by numbers of page views, Wikipedia is one of the most popular sources of health information on the Internet. Based on the available data, and assuming that page views indicate a choice in finding health information and that popularity is defined by what people choose most, it would not be unreasonable to suppose that Wikipedia has become the single most popular source of health information in the world."
Is this strong enough? Is it too strong? Publishing a statement like this might be an invitation to the world to see if anyone can refute this. I am less interested in getting any particular kind of data than I am in identifying a statement that everyone in the Wikimedia movement can back. This is not going to be a technical article and it is preferable that I present no data at all - this article will be a feature in the magazine intended to present Wikipedia and start conversation. If anyone does have questions then I would like to anticipate them in advance and be able to prepare solid supporting evidence, and it would be nice to present data in response to requests for clarification. It might be that people with questions do not come to me at all - they may well contact the WMF and ask you guys to explain whatever I did. It does seem that the Wikimedia Foundation is competing with and perhaps out-performing organizations including NIH, the CDC, and the WHO in their own field of public health education outreach, so some people may have questions about what is happening here.
Since I feel like a simple statement like this has potential to influence the movement brand, I wanted to check in with you all. How do you guys feel about the statement I have written above? Would you be comfortable backing it? What would you want to see from me if I were to earn your support for making a statement like this?
If anyone has other interpretations of the available data or alternative proposals for some fundamental statement about the relative traffic to Wikipedia health content, then please share.
thanks,
On Fri, Oct 4, 2013 at 10:07 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Lane. I am working with Diederik on the pageviews API he mentioned, and your use case is one I care about a lot. Please please let us know what progress you've been making with the approach that Diederik mentioned, and if you get stuck people on this list can help. October 11th is my birthday and I don't like making people sad on my birthday.
Some things about the data stats.grok.se uses:
1. It does not include pageviews for the mobile versions of our projects, and this is pretty significant (like 20% if you want a rough and wildly inaccurate single number). You can take a look here to compare pageviews for mobile / non-mobile versions of our projects: http://reportcard.wmflabs.org/graphs/pageviews_mobile, http://reportcard.wmflabs.org/graphs/pageviews. This past month, enwiki got 1.81B for mobile versions and 10.81B for non-mobile.
2. It does include redirects/renames as two pageviews (slight over-reporting)
3. It has a newly found 2% to 6% under-reporting problem: https://bugzilla.wikimedia.org/show_bug.cgi?id=54504
As we build our pageviews API we are going to solve all these problems and more, but not for our first release.
On Friday, October 4, 2013, Diederik van Liere wrote:
Hi,
Thanks for reaching out to us, you are definitely asking the questions in the right column. As Dario mentioned, we are working on pageview api which will eventually have support to query pageview counts for all pages belonging to category Foo. But that won't be operational before October 11 but I do want to see if we can help you nonetheless.
My advice would be the following:
1) Create an account on https://wikitech.wikimedia.org/wiki/Main_Page
2) Request access (https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request) to be part of the tools-lab project -- this project offers access to redacted mirrors of the actual production mysql databases.
3) Run the following queries (I took Medicine as an example category)
Query 1:
#Get all subcategories for category Medicine
SELECT
page.page_title
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'subcat';
This will give you the result as can be found on https:/en/wikipedia.org/wiki/Category:Medicine
It gives a list of all the first-order child categories of the root category 'Medicine'. Now obviously you could traverse further down and get sub-sub categories etc but this is merely to illustrate a minimum approach. (Constructing a category graph is not entirely trivial as you have to consider potential loops).
Query 2:
#Get all pages from category Medicine
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'page';
This query returns all the article titles and their namespace for pages that belong to the 'Medicine' category. This mirrors the second table from https:/en/wikipedia.org/wiki/Category:Medicine
Query 3:
#Get all pages from category Medicine and it's subcategories
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE cl_to IN (SELECT page.page_title FROM page INNER JOIN categorylinks ON page.page_id = categorylinks.cl_from WHERE cl_to = 'Medicine' AND cl_type = 'subcat')
AND
cl_type = 'page'
AND page.page_namespace = 0;
The final query basically combines query 1 and query 2 and get's a list of article titles that belong either to the category 'Medicine' or one of it's subcategories.
I have attached a csv file with the results of that query. It contains 2809 article titles. I am sure the queries are not dealing with all the edge cases, can be refined etc. but my goal was to illustrate how to tackle your problem using existing tools that are available for all.
4) Finally, you would have to run a simple script and retrieve the pageview numbers for each of 2809 article titles from stats.grok.se (this you will have to do yourself but a combination of bash, wget and qs should do the trick or write a python / php / ruby script that does this for you).
That's all ;)
We do want to make this feature part of a more general purpose pageview api for which we are discussing the contours on this list. Please chime in with your use-cases!
I hope this will help you to get the data before October 11th.
Best,
Diederik
On Thu, Oct 3, 2013 at 11:52 AM, Lane Rasberry lane@bluerasberry.com wrote:
Hello Wikipedia data enthusiasts!
My name is Lane Rasberry, user:bluerasberry, and I contribute to health content on English Wikipedia. I am writing to ask for help from WMF people and community allies in drafting and backing with evidence a statement for publication in a medical journal. The statement that I would like to make is something like this:
"The amount of traffic received by health articles on Wikipedia makes Wikipedia a significant source of health information."
When I make this statement, I would like to be able to do so as clearly as possible and in a way that is backed by authentication by the Wikimedia Foundation and probably a bit of data, perhaps in the form of a comparison with traffic to another health website. I happen to work for the US-based non-profit organization Consumer Reports, and we have thought about comparing Wikipedia's traffic with WebMD's traffic, as WebMD is sometimes reported as being the most popular source of health information online or in the world. At Consumer Reports we get traffic data from Nielsen, so that would be the source for comparison data. https://en.wikipedia.org/wiki/Nielsen_Holdings
I need help from other stakeholders from this because if this article is published - and this is not unlikely because it was requested of me - then it could be cited by other people doing outreach as supporting evidence of the impact and worthiness of developing Wikimedia content related to health. Even if it is not published in this instance the increasing media attention which Wikipedia health content is getting merits having some verified statement to share about traffic.
I wrote more about why I need this statement and how it can be reused at https://meta.wikimedia.org/wiki/Wiki_Project_Med/traffic
I am writing some individuals in addition to sending this to mailing lists for the following reasons:
. Dario and Jonathan Morgan, you both are Wikimedia data people and I have talked with you both about this directly
. Erik Zachte, I talked with you about this generally http://bluerasberry.com/2013/02/the-metric-i-want-from-the-wikimedia-analyt ics-team/ in Feb 2013
. Doc James, we both say that Wikipedia health content is popular but neither of us do this with authenticated data
. Jake Orlowitz, you are managing Wikipedia's relationship with the Cochrane Collaboration and they also are partnering with the Wikipedia health community on the premise that traffic matters
. AnthonyhCole, you were asking me for my opinion about what I think the WMF could do to support people doing Wikipedia outreach in health. I think lots of people would find a statement about traffic to health articles useful.
. Matthew Roth, you manage communications at the Wikimedia Foundation and if you want any input into what I am doing then I would love to have your advice.
I need this soon - perhaps by October 11? Is that possible? How much work would it be to make this statement? Can someone with the WMF Analytics Team and WMF communications help me? Am I in the right forums?
Thanks,