Hello together,
in the framework of a GLAM project, we are looking for ways to (1) identify the number of pages in a given category - including via subcategories - on a given wiki (2) get the pageview stats for all these pages, including on aggregate (3) do the above across languages or projects (4) estimate what outcomes to expect in terms of Wikipedia pageviews and related metrics after an image donation of X files to a given category on Commons.
I assume that part of it is available via the API but couldn't find anything close enough.
Any pointers would be appreciated.
Thanks and cheers,
Daniel
On Tue, Jul 9, 2013 at 10:46 AM, Daniel Mietchen daniel.mietchen@googlemail.com wrote:
Hello together,
in the framework of a GLAM project, we are looking for ways to (1) identify the number of pages in a given category - including via subcategories - on a given wiki
You can get the list of subcategories of a category with list=categorymembers&cmtype=subcat. You'd have to make calls to this for each individual (sub)category you're interested in, and be sure to detect cycles properly.
You can get the number of pages in a category with prop=categoryinfo. You can batch this by specifying up to 50 titles per query (500 if your account has the "apihighlimits" userright).
If you're going to be doing a lot of this, it might be better to perform queries directly against the database, either by downloading the database dumps or using Tool Labs.
(2) get the pageview stats for all these pages, including on aggregate
The raw pageview stat data may also be available on Tool Labs. I see some data in /shared/viewstats/, but it doesn't seem to be up to date.
On 7/9/13, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:
On Tue, Jul 9, 2013 at 10:46 AM, Daniel Mietchen daniel.mietchen@googlemail.com wrote:
Hello together,
in the framework of a GLAM project, we are looking for ways to (1) identify the number of pages in a given category - including via subcategories - on a given wiki
You can get the list of subcategories of a category with list=categorymembers&cmtype=subcat. You'd have to make calls to this for each individual (sub)category you're interested in, and be sure to detect cycles properly.
You can get the number of pages in a category with prop=categoryinfo. You can batch this by specifying up to 50 titles per query (500 if your account has the "apihighlimits" userright).
If you're going to be doing a lot of this, it might be better to perform queries directly against the database, either by downloading the database dumps or using Tool Labs.
(2) get the pageview stats for all these pages, including on aggregate
The raw pageview stat data may also be available on Tool Labs. I see some data in /shared/viewstats/, but it doesn't seem to be up to date.
-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
It should be noted that the category table cat_pages entries are sometimes inaccurate (especially for larger categories), and are closer to an order of mangitude estimate. If you're going to be looking at page views of all entries in the category, you could just count how many pages there are directly.
Page view stats are available at http://dumps.wikimedia.org/other/pagecounts-raw/
--bawolff
wikitech-l@lists.wikimedia.org