Getting more than just 1000 top articles from REST API

List overview All Threads
Download

newer

older

Tuesday's deployment window...

Fwd: [Ops] new deployment server...

Srdjan Grubor

31 Mar 2018 31 Mar '18

3:59 a.m.

Heya, I asked this on IRC but didn't get any replies so I'm following it up this way. I have a question about the newer metrics REST v1 API: is there a way to specify how many top articles to pull from https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_to... or is 1k hardcoded? Old metrics data was available that had the most viewed pages but that disappeared with the change to the new API.

The reason I ask is because we (https://endlessos.com) are trying to rebuild our stale encyclopedia apps for offline usage but are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope that varies with the device in question (up to 100k article limit probably) but the new API doesn't provide us with the tools to figure out the rankings cleanly (other than rate-limiting on our side and checking every single article's metric endpoint for counts).

So the main question is: do we have a way to get this data out with the current API? If this data is not available, can the "metrics/pageviews/top" API be augmented to maybe have a `skip` and/or `limit` params like other similar services that have this type of filtering?

Thanks,

..........................................................................

Srdjan Grubor | +1.314.540.8328 | Endless http://endlessm.com/

Attachments:

attachment.htm (text/html — 4.2 KB)

Show replies by date

Marko Obrovac

1 Apr 1 Apr

11:51 p.m.

New subject: Getting more than just 1000 top articles from REST API

(+Analytics-l)

Hello Srdjan,

The 1k limit is a hard one: only the top 1000 articles for a given day get loaded into the database. I added the folks from the Analytics team to this thread, they may be able to help you, as they generate and expose the data in question.

Cheers, Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

On 30 March 2018 at 16:59, Srdjan Grubor srdjan@endlessm.com wrote:

...

Heya, I asked this on IRC but didn't get any replies so I'm following it up this way. I have a question about the newer metrics REST v1 API: is there a way to specify how many top articles to pull from https://wikimedia.org/api/ rest_v1/#!/Pageviews_data/get_metrics_pageviews_top_project_ access_year_month_day or is 1k hardcoded? Old metrics data was available that had the most viewed pages but that disappeared with the change to the new API.

The reason I ask is because we (https://endlessos.com) are trying to rebuild our stale encyclopedia apps for offline usage but are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope that varies with the device in question (up to 100k article limit probably) but the new API doesn't provide us with the tools to figure out the rankings cleanly (other than rate-limiting on our side and checking every single article's metric endpoint for counts).

So the main question is: do we have a way to get this data out with the current API? If this data is not available, can the " metrics/pageviews/top" API be augmented to maybe have a `skip` and/or ` limit` params like other similar services that have this type of filtering?

Thanks,

..........................................................................

Srdjan Grubor | +1.314.540.8328 <(314)%20540-8328> | Endless http://endlessm.com/

Services mailing list Services@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/services

Dan Andreescu

2 Apr 2 Apr

10:47 p.m.

New subject: [Analytics] Getting more than just 1000 top articles from REST API

Hi Srdjan,

The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/other/pagecounts-ez/, look at the "Hourly page views per article" section. I would imagine for your use case one month of data is enough, and you can get the top N articles for all wikis this way, where N is anything you want. These files are compressed, so when you process and expand the data you'll see the reason we can't do this dynamically: it's huge data and our cluster is limited.

On Sun, Apr 1, 2018 at 11:51 AM, Marko Obrovac mobrovac@wikimedia.org wrote:

...

(+Analytics-l)

Hello Srdjan,

The 1k limit is a hard one: only the top 1000 articles for a given day get loaded into the database. I added the folks from the Analytics team to this thread, they may be able to help you, as they generate and expose the data in question.

Cheers, Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

On 30 March 2018 at 16:59, Srdjan Grubor srdjan@endlessm.com wrote:

...
Heya, I asked this on IRC but didn't get any replies so I'm following it up this way. I have a question about the newer metrics REST v1 API: is there a way to specify how many top articles to pull from https://wikimedia.org/api/rest _v1/#!/Pageviews_data/get_metrics_pageviews_top_project_acce ss_year_month_day or is 1k hardcoded? Old metrics data was available that had the most viewed pages but that disappeared with the change to the new API.

The reason I ask is because we (https://endlessos.com) are trying to rebuild our stale encyclopedia apps for offline usage but are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope that varies with the device in question (up to 100k article limit probably) but the new API doesn't provide us with the tools to figure out the rankings cleanly (other than rate-limiting on our side and checking every single article's metric endpoint for counts).

So the main question is: do we have a way to get this data out with the current API? If this data is not available, can the " metrics/pageviews/top" API be augmented to maybe have a `skip` and/or ` limit` params like other similar services that have this type of filtering?

Thanks,

............................................................ ..............

Srdjan Grubor | +1.314.540.8328 <(314)%20540-8328> | Endless http://endlessm.com/

Services mailing list Services@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/services

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Srdjan Grubor

11:08 p.m.

New subject: [Analytics] Getting more than just 1000 top articles from REST API

Hi Dan / Marco, Let me take a look and see if this will be good enough but it looks promising! If you don't hear from me again, all is well :)

Thanks!

..........................................................................

Srdjan Grubor | +1.314.540.8328 | Endless http://endlessm.com/

On Mon, Apr 2, 2018 at 9:47 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Hi Srdjan,

The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/ other/pagecounts-ez/, look at the "Hourly page views per article" section. I would imagine for your use case one month of data is enough, and you can get the top N articles for all wikis this way, where N is anything you want. These files are compressed, so when you process and expand the data you'll see the reason we can't do this dynamically: it's huge data and our cluster is limited.

On Sun, Apr 1, 2018 at 11:51 AM, Marko Obrovac mobrovac@wikimedia.org wrote:

...
(+Analytics-l)

Hello Srdjan,

The 1k limit is a hard one: only the top 1000 articles for a given day get loaded into the database. I added the folks from the Analytics team to this thread, they may be able to help you, as they generate and expose the data in question.

Cheers, Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

On 30 March 2018 at 16:59, Srdjan Grubor srdjan@endlessm.com wrote:

...
Heya, I asked this on IRC but didn't get any replies so I'm following it up this way. I have a question about the newer metrics REST v1 API: is there a way to specify how many top articles to pull from https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metr ics_pageviews_top_project_access_year_month_day or is 1k hardcoded? Old metrics data was available that had the most viewed pages but that disappeared with the change to the new API.

The reason I ask is because we (https://endlessos.com) are trying to rebuild our stale encyclopedia apps for offline usage but are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope that varies with the device in question (up to 100k article limit probably) but the new API doesn't provide us with the tools to figure out the rankings cleanly (other than rate-limiting on our side and checking every single article's metric endpoint for counts).

So the main question is: do we have a way to get this data out with the current API? If this data is not available, can the " metrics/pageviews/top" API be augmented to maybe have a `skip` and/or ` limit` params like other similar services that have this type of filtering?

Thanks,

............................................................ ..............

Srdjan Grubor | +1.314.540.8328 <(314)%20540-8328> | Endless http://endlessm.com/

Services mailing list Services@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/services

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

11:54 p.m.

New subject: [Analytics] Getting more than just 1000 top articles from REST API

On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Hi Srdjan,

The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/ other/pagecounts-ez/, look at the "Hourly page views per article" section. I would imagine for your use case one month of data is enough, and you can get the top N articles for all wikis this way, where N is anything you want.

One suggestion here is that if you want to find articles that are consistently high-page-view (and not part of spike/trend-views), you increase the time-window to 6 months or longer.

Best, Leila

-- Leila Zia Senior Research Scientist, Lead Wikimedia Foundation

Nuria Ruiz

3 Apr 3 Apr

12:09 a.m.

New subject: [Analytics] Getting more than just 1000 top articles from REST API

...

are trying to rebuild our stale encyclopedia apps for offline usage but

are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope >that varies with the device in question (up to 100k article limit probably) For this use case I would be careful to look at page ranks as true popularity as the top data is affected by bot spikes regularly (that is a known issue that we intend to fix). After you have your list of most popular pages please take a second look, some -but not all- of the pages that are artificially high due to bot traffic are pretty obvious (many special pages).

On Mon, Apr 2, 2018 at 8:54 AM, Leila Zia leila@wikimedia.org wrote:

...

On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi Srdjan,

The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/ot her/pagecounts-ez/, look at the "Hourly page views per article" section. I would imagine for your use case one month of data is enough, and you can get the top N articles for all wikis this way, where N is anything you want.

One suggestion here is that if you want to find articles that are consistently high-page-view (and not part of spike/trend-views), you increase the time-window to 6 months or longer.

Best, Leila

-- Leila Zia Senior Research Scientist, Lead Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

12:11 a.m.

New subject: [Analytics] Getting more than just 1000 top articles from REST API

Regarding what Nuria and Leila said: that makes sense for the top 1000, but if we take the top 100,000 pages, I figured spikes wouldn't really matter, pages that spike there are both likely to also be in the top 100,000 normally and few enough in number that they wouldn't pollute the data anyway (assuming you're filtering for actual content articles instead of all pages)

On Mon, Apr 2, 2018 at 12:09 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
are trying to rebuild our stale encyclopedia apps for offline usage but

are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope >that varies with the device in question (up to 100k article limit probably) For this use case I would be careful to look at page ranks as true popularity as the top data is affected by bot spikes regularly (that is a known issue that we intend to fix). After you have your list of most popular pages please take a second look, some -but not all- of the pages that are artificially high due to bot traffic are pretty obvious (many special pages).

On Mon, Apr 2, 2018 at 8:54 AM, Leila Zia leila@wikimedia.org wrote:

...
On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi Srdjan,

The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/ot her/pagecounts-ez/, look at the "Hourly page views per article" section. I would imagine for your use case one month of data is enough, and you can get the top N articles for all wikis this way, where N is anything you want.

One suggestion here is that if you want to find articles that are consistently high-page-view (and not part of spike/trend-views), you increase the time-window to 6 months or longer.

Best, Leila

-- Leila Zia Senior Research Scientist, Lead Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Morten Wang

12:20 a.m.

New subject: [Analytics] Getting more than just 1000 top articles from REST API

There are several methods for identifying spikes in page views, and if you're interested in identifying candidates amongst the very popular articles, they can work well. In our 2015 ICWSM paper (citation below), we used two additional months of data and ARIMA models with good results, but there are other approaches available as well (if I remember correctly, we cite some of the research in case you're looking for further reading).

The question of "what pages to include" is also closely related to the WP 1.0 Assessment project, which uses a combination of views, importance, and quality to rank pages. Might be useful to read up on their methodology: https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Article_s...

Warncke-Wang, M., Ranjan, V., Terveen, L. G., & Hecht, B. J. (2015, May). Misalignment Between Supply and Demand of Quality Content in Peer Production Communities. In *ICWSM *(pp. 493-502).

Cheers, Morten

On 2 April 2018 at 09:11, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Regarding what Nuria and Leila said: that makes sense for the top 1000, but if we take the top 100,000 pages, I figured spikes wouldn't really matter, pages that spike there are both likely to also be in the top 100,000 normally and few enough in number that they wouldn't pollute the data anyway (assuming you're filtering for actual content articles instead of all pages)

On Mon, Apr 2, 2018 at 12:09 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...
...
are trying to rebuild our stale encyclopedia apps for offline usage but

are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope >that varies with the device in question (up to 100k article limit probably) For this use case I would be careful to look at page ranks as true popularity as the top data is affected by bot spikes regularly (that is a known issue that we intend to fix). After you have your list of most popular pages please take a second look, some -but not all- of the pages that are artificially high due to bot traffic are pretty obvious (many special pages).

On Mon, Apr 2, 2018 at 8:54 AM, Leila Zia leila@wikimedia.org wrote:

...
On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi Srdjan,

The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/ot her/pagecounts-ez/, look at the "Hourly page views per article" section. I would imagine for your use case one month of data is enough, and you can get the top N articles for all wikis this way, where N is anything you want.

One suggestion here is that if you want to find articles that are consistently high-page-view (and not part of spike/trend-views), you increase the time-window to 6 months or longer.

Best, Leila

-- Leila Zia Senior Research Scientist, Lead Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

2461

Age (days ago)

2464

Last active (days ago)

services@lists.wikimedia.org

7 comments

6 participants

tags (0)

participants (6)

Dan Andreescu
Leila Zia
Marko Obrovac
Morten Wang
Nuria Ruiz
Srdjan Grubor