Regarding what Nuria and Leila said: that makes sense for the top 1000, but if we take the top 100,000 pages, I figured spikes wouldn't really matter, pages that spike there are both likely to also be in the top 100,000 normally and few enough in number that they wouldn't pollute the data anyway (assuming you're filtering for actual content articles instead of all pages)
On Mon, Apr 2, 2018 at 12:09 PM, Nuria Ruiz nuria@wikimedia.org wrote:
are trying to rebuild our stale encyclopedia apps for offline usage but
are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope >that varies with the device in question (up to 100k article limit probably) For this use case I would be careful to look at page ranks as true popularity as the top data is affected by bot spikes regularly (that is a known issue that we intend to fix). After you have your list of most popular pages please take a second look, some -but not all- of the pages that are artificially high due to bot traffic are pretty obvious (many special pages).
On Mon, Apr 2, 2018 at 8:54 AM, Leila Zia leila@wikimedia.org wrote:
On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Srdjan,
The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/ot her/pagecounts-ez/, look at the "Hourly page views per article" section. I would imagine for your use case one month of data is enough, and you can get the top N articles for all wikis this way, where N is anything you want.
One suggestion here is that if you want to find articles that are consistently high-page-view (and not part of spike/trend-views), you increase the time-window to 6 months or longer.
Best, Leila
-- Leila Zia Senior Research Scientist, Lead Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics