There are several methods for identifying spikes in page views, and if
you're interested in identifying candidates amongst the very popular
articles, they can work well. In our 2015 ICWSM paper (citation below), we
used two additional months of data and ARIMA models with good results, but
there are other approaches available as well (if I remember correctly, we
cite some of the research in case you're looking for further reading).
The question of "what pages to include" is also closely related to the WP
1.0 Assessment project, which uses a combination of views, importance, and
quality to rank pages. Might be useful to read up on their methodology:
https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Article_…
Warncke-Wang, M., Ranjan, V., Terveen, L. G., & Hecht, B. J. (2015, May).
Misalignment Between Supply and Demand of Quality Content in Peer
Production Communities. In *ICWSM *(pp. 493-502).
Cheers,
Morten
On 2 April 2018 at 09:11, Dan Andreescu <dandreescu(a)wikimedia.org> wrote:
Regarding what Nuria and Leila said: that makes sense
for the top 1000,
but if we take the top 100,000 pages, I figured spikes wouldn't really
matter, pages that spike there are both likely to also be in the top
100,000 normally and few enough in number that they wouldn't pollute the
data anyway (assuming you're filtering for actual content articles instead
of all pages)
On Mon, Apr 2, 2018 at 12:09 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
are trying
to rebuild our stale encyclopedia apps for offline usage but
are space-limited and
would only like to include the most likely pages that
would be looked at that can fit within a size envelope >that varies with
the device in question (up to 100k article limit probably)
For this use case I would be careful to look at page ranks as true
popularity as the top data is affected by bot spikes regularly (that is a
known issue that we intend to fix). After you have your list of most
popular pages please take a second look, some -but not all- of the pages
that are artificially high due to bot traffic are pretty obvious (many
special pages).
On Mon, Apr 2, 2018 at 8:54 AM, Leila Zia <leila(a)wikimedia.org> wrote:
On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
Hi Srdjan,
The data pipeline behind the API can't handle arbitrary skip or limit
parameters, but there's a better way for the kind of question you have. We
publish all the pageviews at
https://dumps.wikimedia.org/ot
her/pagecounts-ez/, look at the "Hourly page views per article"
section. I would imagine for your use case one month of data is enough,
and you can get the top N articles for all wikis this way, where N is
anything you want.
One suggestion here is that if you want to find articles that are
consistently high-page-view (and not part of spike/trend-views), you
increase the time-window to 6 months or longer.
Best,
Leila
--
Leila Zia
Senior Research Scientist, Lead
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics