Hi Lane,

 

Did you see these reports?

 

Here is a category tree below category 'Health' on English Wikipedia (with some out-of-context sub branches blacklisted).

http://stats.wikimedia.org/wikimedia/pageviews/categorized/wp-en/2013-07/categories_wp-en_cat_Health_2013-07.html

 

Here are the page views for articles in all those categories:

Warning the list is overly complete by design:

Some top ranking titles in this list may seem out of place.

Please note that any Wikipedia article can have tens of categories assigned to it.

A popular article will rank high in any list where it's featured, regardless of the category under review.

Thus a well-known singer may be top ranking in a list about politicians, because he/she also played a minor or brief role in politics.

Iterative pruning of the category tree will yield better results. Now you have to do final filtering yourself.

 

http://stats.wikimedia.org/wikimedia/pageviews/categorized/wp-en/2013-07/pageviews_wp-en_cat_Health_2013-07.html

 

 

New insight:

Instead of using the category hierarchy, article lists from WikiProjects would yield cleaner results, and would suffice for many purposes, notably yours :-)

 

Cheers,

Erik

 

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Lane Rasberry
Sent: Friday, October 04, 2013 4:39 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Cc: Jake Orlowitz; Anthony Cole; James Heilman; Wiki Medicine discussion; Matthew Roth
Subject: Re: [Analytics] need traffic data for health content...

 

Thanks Diederik and Dan,

We at WikiProject Medicine already have some simple data based on grok.se reports. Unless the system you guys are sharing reports traffic greatly less than what grok.se shows, then I already have an idea of what I want to say. If I followed your instructions, got some numbers, and I drew a conclusion, would the analytics team and the Wikimedia Foundation be comfortable with whatever I said and validate my assertions? What if I said -

 

"Judging by numbers of page views, Wikipedia is one of the most popular sources of health information on the Internet. Based on the available data, and assuming that page views indicate a choice in finding health information and that popularity is defined by what people choose most, it would not be unreasonable to suppose that Wikipedia has become the single most popular source of health information in the world."

Is this strong enough? Is it too strong? Publishing a statement like this might be an invitation to the world to see if anyone can refute this. I am less interested in getting any particular kind of data than I am in identifying a statement that everyone in the Wikimedia movement can back. This is not going to be a technical article and it is preferable that I present no data at all - this article will be a feature in the magazine intended to present Wikipedia and start conversation. If anyone does have questions then I would like to anticipate them in advance and be able to prepare solid supporting evidence, and it would be nice to present data in response to requests for clarification. It might be that people with questions do not come to me at all - they may well contact the WMF and ask you guys to explain whatever I did. It does seem that the Wikimedia Foundation is competing with and perhaps out-performing organizations including NIH, the CDC, and the WHO in their own field of public health education outreach, so some people may have questions about what is happening here.

Since I feel like a simple statement like this has potential to influence the movement brand, I wanted to check in with you all. How do you guys feel about the statement I have written above? Would you be comfortable backing it? What would you want to see from me if I were to earn your support for making a statement like this?

If anyone has other interpretations of the available data or alternative proposals for some fundamental statement about the relative traffic to Wikipedia health content, then please share.

 

thanks,

 

 

On Fri, Oct 4, 2013 at 10:07 AM, Dan Andreescu <dandreescu@wikimedia.org> wrote:

Hi Lane.  I am working with Diederik on the pageviews API he mentioned, and your use case is one I care about a lot.  Please please let us know what progress you've been making with the approach that Diederik mentioned, and if you get stuck people on this list can help.  October 11th is my birthday and I don't like making people sad on my birthday.

 

Some things about the data stats.grok.se uses:

 

1. It does not include pageviews for the mobile versions of our projects, and this is pretty significant (like 20% if you want a rough and wildly inaccurate single number).  You can take a look here to compare pageviews for mobile / non-mobile versions of our projects: http://reportcard.wmflabs.org/graphs/pageviews_mobilehttp://reportcard.wmflabs.org/graphs/pageviews.  This past month, enwiki got 1.81B for mobile versions and 10.81B for non-mobile.

 

2. It does include redirects/renames as two pageviews (slight over-reporting)

 

3. It has a newly found 2% to 6% under-reporting problem: https://bugzilla.wikimedia.org/show_bug.cgi?id=54504

 

As we build our pageviews API we are going to solve all these problems and more, but not for our first release.

 


On Friday, October 4, 2013, Diederik van Liere wrote:

Hi,

 

Thanks for reaching out to us, you are definitely asking the questions in the right column. As Dario mentioned, we are working on pageview api which will eventually have support to query pageview counts for all pages belonging to category Foo. But that won't be operational before October 11 but I do want to see if we can help you nonetheless.

 

My advice would be the following:

 

2) Request access (https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request)  to be part of the tools-lab project -- this project offers access to redacted mirrors of the actual production mysql databases.

3) Run the following queries (I took Medicine as an example category)

 

Query 1:

#Get all subcategories for category Medicine

SELECT 

page.page_title 

FROM 

page 

INNER JOIN 

categorylinks 

ON 

page.page_id = categorylinks.cl_from 

WHERE 

cl_to IN ('Medicine') 

AND 

cl_type = 'subcat';

 

 

This will give you the result as can be found on https:/en/wikipedia.org/wiki/Category:Medicine

It gives a list of all the first-order child categories of the root category 'Medicine'. Now obviously you could traverse further down and get sub-sub categories etc but this is merely to illustrate a minimum approach. (Constructing a category graph is not entirely trivial as you have to consider potential loops).

 

Query 2:

#Get all pages from category Medicine

SELECT 

page.page_title,

page.page_namespace

FROM 

page 

INNER JOIN 

categorylinks 

ON 

page.page_id = categorylinks.cl_from 

WHERE 

cl_to IN ('Medicine') 

AND 

cl_type = 'page';

 

This query returns all the article titles and their namespace for pages that belong to the 'Medicine' category. This mirrors the second table from https:/en/wikipedia.org/wiki/Category:Medicine

 

 

Query 3:

#Get all pages from category Medicine and it's subcategories

SELECT 

page.page_title,

page.page_namespace

FROM 

page 

INNER JOIN 

categorylinks 

ON 

page.page_id = categorylinks.cl_from 

WHERE cl_to IN (SELECT page.page_title FROM page INNER JOIN categorylinks ON page.page_id = categorylinks.cl_from WHERE cl_to = 'Medicine' AND cl_type = 'subcat') 

AND 

cl_type = 'page'

AND page.page_namespace = 0;

 

The final query basically combines query 1 and query 2 and get's a list of article titles that belong either to the category 'Medicine' or one of it's subcategories.

I have attached a csv file with the results of that query. It contains 2809 article titles. I am sure the queries are not dealing with all the edge cases, can be refined etc. but my goal was to illustrate how to tackle your problem using existing tools that are available for all. 

 

 

4) Finally, you would have to run a simple script and retrieve the pageview numbers for each of 2809 article titles from stats.grok.se (this you will have to do yourself but a combination of  bash, wget and qs should do the trick or write a python / php / ruby script that does this for you).

 

That's all ;) 

 

 

We do want to make this feature part of a more general purpose pageview api for which we are discussing the contours on this list. Please chime in with your use-cases!

 

I hope this will help you to get the data before October 11th. 

 

 

 

Best,

 

Diederik

 

 

 

 

On Thu, Oct 3, 2013 at 11:52 AM, Lane Rasberry <lane@bluerasberry.com> wrote:

Hello Wikipedia data enthusiasts!

 

My name is Lane Rasberry, user:bluerasberry, and I contribute to health content on English Wikipedia. I am writing to ask for help from WMF people and community allies in drafting and backing with evidence a statement for publication in a medical journal. The statement that I would like to make is something like this:

"The amount of traffic received by health articles on Wikipedia makes Wikipedia a significant source of health information."

When I make this statement, I would like to be able to do so as clearly as possible and in a way that is backed by authentication by the Wikimedia Foundation and probably a bit of data, perhaps in the form of a comparison with traffic to another health website. I happen to work for the US-based non-profit organization Consumer Reports, and we have thought about comparing Wikipedia's traffic with WebMD's traffic, as WebMD is sometimes reported as being the most popular source of health information online or in the world. At Consumer Reports we get traffic data from Nielsen, so that would be the source for comparison data.
<https://en.wikipedia.org/wiki/Nielsen_Holdings>


I need help from other stakeholders from this because if this article is published - and this is not unlikely because it was requested of me - then it could be cited by other people doing outreach as supporting evidence of the impact and worthiness of developing Wikimedia content related to health. Even if it is not published in this instance the increasing media attention which Wikipedia health content is getting merits having some verified statement to share about traffic.

I wrote more about why I need this statement and how it can be reused at
<https://meta.wikimedia.org/wiki/Wiki_Project_Med/traffic>

I am writing some individuals in addition to sending this to mailing lists for the following reasons:

·         Dario and Jonathan Morgan, you both are Wikimedia data people and I have talked with you both about this directly

·         Erik Zachte, I talked with you about this generally in Feb 2013

·         Doc James, we both say that Wikipedia health content is popular but neither of us do this with authenticated data

·         Jake Orlowitz, you are managing Wikipedia's relationship with the Cochrane Collaboration and they also are partnering with the Wikipedia health community on the premise that traffic matters

·         AnthonyhCole, you were asking me for my opinion about what I think the WMF could do to support people doing Wikipedia outreach in health. I think lots of people would find a statement about traffic to health articles useful.

·         Matthew Roth, you manage communications at the Wikimedia Foundation and if you want any input into what I am doing then I would love to have your advice.

I need this soon - perhaps by October 11? Is that possible? How much work would it be to make this statement? Can someone with the WMF Analytics Team and WMF communications help me? Am I in the right forums?

Thanks,

 

--
Lane Rasberry

user:bluerasberry on Wikipedia


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

 


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




--

Lane Rasberry

user:bluerasberry on Wikipedia

206.801.0814
lane@bluerasberry.com