Hi,
Thanks for reaching out to us, you are definitely asking the questions in
the right column. As Dario mentioned, we are working on pageview api which
will eventually have support to query pageview counts for all pages
belonging to category Foo. But that won't be operational before October 11
but I do want to see if we can help you nonetheless.
My advice would be the following:
1) Create an account on
https://wikitech.wikimedia.org/wiki/Main_Page
2) Request access (
https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request)
to be part of the tools-lab project -- this project offers access to
redacted mirrors of the actual production mysql databases.
3) Run the following queries (I took Medicine as an example category)
Query 1:
#Get all subcategories for category Medicine
SELECT
page.page_title
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'subcat';
This will give you the result as can be found on https:/en/
wikipedia.org/wiki/Category:Medicine
It gives a list of all the first-order child categories of the root
category 'Medicine'. Now obviously you could traverse further down and get
sub-sub categories etc but this is merely to illustrate a minimum approach.
(Constructing a category graph is not entirely trivial as you have to
consider potential loops).
Query 2:
#Get all pages from category Medicine
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'page';
This query returns all the article titles and their namespace for pages
that belong to the 'Medicine' category. This mirrors the second table from
https:/en/wikipedia.org/wiki/Category:Medicine
Query 3:
#Get all pages from category Medicine and it's subcategories
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE cl_to IN (SELECT page.page_title FROM page INNER JOIN categorylinks
ON page.page_id = categorylinks.cl_from WHERE cl_to = 'Medicine' AND
cl_type = 'subcat')
AND
cl_type = 'page'
AND page.page_namespace = 0;
The final query basically combines query 1 and query 2 and get's a list of
article titles that belong either to the category 'Medicine' or one of it's
subcategories.
I have attached a csv file with the results of that query. It contains 2809
article titles. I am sure the queries are not dealing with all the edge
cases, can be refined etc. but my goal was to illustrate how to tackle your
problem using existing tools that are available for all.
4) Finally, you would have to run a simple script and retrieve the pageview
numbers for each of 2809 article titles from stats.grok.se (this you will
have to do yourself but a combination of bash, wget and qs should do the
trick or write a python / php / ruby script that does this for you).
That's all ;)
We do want to make this feature part of a more general purpose pageview api
for which we are discussing the contours on this list. Please chime in with
your use-cases!
I hope this will help you to get the data before October 11th.
Best,
Diederik
On Thu, Oct 3, 2013 at 11:52 AM, Lane Rasberry <lane(a)bluerasberry.com>wrote;wrote:
Hello Wikipedia data enthusiasts!
My name is Lane Rasberry, user:bluerasberry, and I contribute to health
content on English Wikipedia.* I am writing to ask for help from WMF
people and community allies in drafting and backing with evidence a
statement for publication in a medical journal. The statement that I would
like to make is something like this:
*
*"The amount of traffic received by health articles on Wikipedia makes
Wikipedia a significant source of health information."*
When I make this statement, I would like to be able to do so as clearly as
possible and in a way that is backed by authentication by the Wikimedia
Foundation and probably a bit of data, perhaps in the form of a comparison
with traffic to another health website. I happen to work for the US-based
non-profit organization Consumer Reports, and we have thought about
comparing Wikipedia's traffic with WebMD's traffic, as WebMD is sometimes
reported as being the most popular source of health information online or
in the world. At Consumer Reports we get traffic data from Nielsen, so that
would be the source for comparison data.
<https://en.wikipedia.org/wiki/Nielsen_Holdings>
I need help from other stakeholders from this because if this article is
published - and this is not unlikely because it was requested of me - then
it could be cited by other people doing outreach as supporting evidence of
the impact and worthiness of developing Wikimedia content related to
health. Even if it is not published in this instance the increasing media
attention which Wikipedia health content is getting merits having some
verified statement to share about traffic.
I wrote more about why I need this statement and how it can be reused at
<https://meta.wikimedia.org/wiki/Wiki_Project_Med/traffic>
I am writing some individuals in addition to sending this to mailing lists
for the following reasons:
- Dario and Jonathan Morgan, you both are Wikimedia data people and I
have talked with you both about this directly
- Erik Zachte, I talked with you about this
generally<http://bluerasberry.com/2013/02/the-metric-i-want-from-the-wik…
Feb 2013
- Doc James, we both say that Wikipedia health content is popular but
neither of us do this with authenticated data
- Jake Orlowitz, you are managing Wikipedia's relationship with the
Cochrane Collaboration and they also are partnering with the Wikipedia
health community on the premise that traffic matters
- AnthonyhCole, you were asking me for my opinion about what I think
the WMF could do to support people doing Wikipedia outreach in health. I
think lots of people would find a statement about traffic to health
articles useful.
- Matthew Roth, you manage communications at the Wikimedia Foundation
and if you want any input into what I am doing then I would love to have
your advice.
I need this soon - perhaps by October 11? Is that possible? How much work
would it be to make this statement? Can someone with the WMF Analytics Team
and WMF communications help me? Am I in the right forums?
Thanks,
--
Lane Rasberry
user:bluerasberry on Wikipedia
206.801.0814
lane(a)bluerasberry.com
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics