Hi Valerio,

Mako was referring to https://dumps.wikimedia.org/other/pagecounts-raw/ and the current logging practices. My understanding is also that these things are not logged on a routine basis. The Wikibench traces seem to have been a special case.

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.

Have the researchers looked into requester-pays data storage on Amazon or another provider? They should be able to make it public with no resources and at no cost to themselves whatever the size.


On Wed, Sep 24, 2014 at 7:09 PM, Valerio Schiavoni <valerio.schiavoni@gmail.com> wrote:
Hello Mako,

On Wed, Sep 24, 2014 at 8:13 AM, Benj. Mako Hill <mako@atdot.cc> wrote:
> Users mostly read the most recent version of a given page, but from time to
> time, read accesses to the 'history' of a page happens.

At least as far as know, views to historical versions of webpages in
Wikipedia don't show up in the access logs at all because certain
kinds of requests (like requests to /w/index.php?oldid=NUMBER) don't
get recorded in the pageview data.

I'm sorry to contradict you, but at least on the Wikibench traces, that information is very well present. I see things like:

That is, back in 2007, users were accessing a version of that page that dated back in 2005 or so.

> New versions of a page are created as well. Finally, users might
> potentially need to explore several old versions of a given web
> page, for example by accessing the details of its history[1].

AFAIK, viewing the history page itself is also not recorded in the
page view data either.

Sorry to contradict you again, but there are indeed logs for that as well:


I'm quite surprised that such informations are not known by the community of Wikipedia researchers.


Wiki-research-l mailing list

Scott Hale
Oxford Internet Institute
University of Oxford