Hi Lane. I am working with Diederik on the pageviews
API he mentioned,
and your use case is one I care about a lot. Please please let us know
what progress you've been making with the approach that Diederik mentioned,
and if you get stuck people on this list can help. October 11th is my
birthday and I don't like making people sad on my birthday.
Some things about the data stats.grok.se uses:
1. It does not include pageviews for the mobile versions of our projects,
and this is pretty significant (like 20% if you want a rough and wildly
inaccurate single number). You can take a look here to compare pageviews
for mobile / non-mobile versions of our projects:
. This past month, enwiki
got 1.81B for mobile versions and 10.81B for non-mobile.
2. It does include redirects/renames as two pageviews (slight
over-reporting)
3. It has a newly found 2% to 6% under-reporting problem:
As we build our pageviews API we are going to solve all these problems and
more, but not for our first release.
On Friday, October 4, 2013, Diederik van Liere wrote:
Hi,
Thanks for reaching out to us, you are definitely asking the questions in
the right column. As Dario mentioned, we are working on pageview api which
will eventually have support to query pageview counts for all pages
belonging to category Foo. But that won't be operational before October 11
but I do want to see if we can help you nonetheless.
My advice would be the following:
1) Create an account on
https://wikitech.wikimedia.org/wiki/Main_Page
2) Request access (
https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request)
to be part of the tools-lab project -- this project offers access to
redacted mirrors of the actual production mysql databases.
3) Run the following queries (I took Medicine as an example category)
Query 1:
#Get all subcategories for category Medicine
SELECT
page.page_title
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'subcat';
This will give you the result as can be found on https:/en/
wikipedia.org/wiki/Category:Medicine
It gives a list of all the first-order child categories of the root
category 'Medicine'. Now obviously you could traverse further down and get
sub-sub categories etc but this is merely to illustrate a minimum approach.
(Constructing a category graph is not entirely trivial as you have to
consider potential loops).
Query 2:
#Get all pages from category Medicine
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE
cl_to IN ('Medicine')
AND
cl_type = 'page';
This query returns all the article titles and their namespace for pages
that belong to the 'Medicine' category. This mirrors the second table from
https:/en/wikipedia.org/wiki/Category:Medicine
Query 3:
#Get all pages from category Medicine and it's subcategories
SELECT
page.page_title,
page.page_namespace
FROM
page
INNER JOIN
categorylinks
ON
page.page_id = categorylinks.cl_from
WHERE cl_to IN (SELECT page.page_title FROM page INNER JOIN categorylinks
ON page.page_id = categorylinks.cl_from WHERE cl_to = 'Medicine' AND
cl_type = 'subcat')
AND
cl_type = 'page'
AND page.page_namespace = 0;
The final query basically combines query 1 and query 2 and get's a list
of article titles that belong either to the category 'Medicine' or one of
it's subcategories.
I have attached a csv file with the results of that query. It contains
2809 article titles. I am sure the queries are not dealing with all the
edge cases, can be refined etc. but my goal was to illustrate how to tackle
your problem using existing tools that are available for all.
4) Finally, you would have to run a simple script and retrieve the
pageview numbers for each of 2809 article titles from stats.grok.se(this you will have to
do yourself but a combination of bash, wget and qs
should do the trick or write a python / php / ruby script that does this
for you).
That's all ;)
We do want to make this feature part of a more general purpose pageview
api for which we are discussing the contours on this list. Please chime in
with your use-cases!
I hope this will help you to get the data before October 11th.
Best,
Diederik
On Thu, Oct 3, 2013 at 11:52 AM, Lane Rasberry <lane(a)bluerasberry.com>wrote;wrote:
Hello Wikipedia data enthusiasts!
My name is Lane Rasberry, user:bluerasberry, and I contribute to health
content on English Wikipedia.* I am writing to ask for help from WMF
people and community allies in drafting and backing with evidence a
statement for publication in a medical journal. The statement that I would
like to make is something like this:
*
*"The amount of traffic received by health articles on Wikipedia makes
Wikipedia a significant source of health information."*
When I make this statement, I would like to be able to do so as clearly
as possible and in a way that is backed by authentication by the Wikimedia
Foundation and probably a bit of data, perhaps in the form of a comparison
with traffic to another health website. I happen to work for the US-based
non-profit organization Consumer Reports, and we have thought about
comparing Wikipedia's traffic with WebMD's traffic, as WebMD is sometimes
reported as being the most popular source of health information online or
in the world. At Consumer Reports we get traffic data from Nielsen, so that
would be the source for comparison data.
<https://en.wikipedia.org/wiki/Nielsen_Holdings>
I need help from other stakeholders from this because if this article is
published - and this is not unlikely because it was requested of me - then
it could be cited by other people doing outreach as supporting evidence of
the impact and worthiness of developing Wikimedia content related to
health. Even if it is not published in this instance the increasing media
attention which Wikipedia health content is getting merits having some
verified statement to share about traffic.
I wrote more about why I need this statement and how it can be reused at
<https://meta.wikimedia.org/wiki/Wiki_Project_Med/traffic>
I am writing some individuals in addition to sending this to mailing
lists for the following reasons:
- Dario and Jonathan Morgan, you both are Wikimedia data people and
I have talked with you both about this directly
- Erik Zachte, I talked with you about this
generally<http://bluerasberry.com/2013/02/the-metric-i-want-from-the-wik…
Feb 2013
- Doc James, we both say that Wikipedia health content is popular
but neither of us do this with authenticated data
- Jake Orlowitz, you are managing Wikipedia's relationship with the
Cochrane Collaboration and they also are partnering with the Wikipedia
health community on the premise that traffic matters
- AnthonyhCole, you were asking me for my opinion about what I think
the WMF could do to support people doing Wikipedia outreach in health. I
think lots of people would find a statement about traffic to health
articles useful.
- Matthew Roth, you manage communications at the Wikimedia
Foundation and if you want any input into what I am doing then I would love
to have your advice.
I need this soon - perhaps by October 11? Is that possible? How much
work would it be to make this statement? Can someone with the WMF Analytics
Team and WMF communications help me? Am I in the right forums?
Thanks,
--
Lane Rasberry
user:bluerasberry on Wikipedia
206.801.0814
lane(a)bluerasberry.com
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org