Hi Michal,
it seems that what you want is a data set, which would be very
similar to what I recently issued a request for: see this
phabricator item
https://phabricator.wikimedia.org/T128132
There has been a public data set for the year 2007, part of which
you publicly available [1]. See [2] for a study using the 2007
data set.
My focus has been on simulating the performance of WMF's caching
servers, for which the 2007 data set is insufficient. However, a
different research domain might require a slightly different focus
of capturing the data set.
The 2007 data set was captured with a sampling rate of 1:10. For
my project, such a high sampling rate would be perfect (1:100
might also work). However, I learned that the current request rate
is much higher so we'd have to narrow the scope of the data set
(e.g., by focussing on specific WMF projects, like the English
Wikipedia). You can find a discussion on the phabricator page
linked above.
What would be the lowest sampling rate allowable for your project?
I assume the publicly available hourly access data [3], [4] would
be insufficient?
Feel free to comment on the phabricator item, maybe we can compile
a single data set that works for both of our research domains and
other helps other people?
Best,
Daniel
[1]
[2]
http://www.distributed-systems.net/papers/2009.comnet-wiki.pdf
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites
[4]
http://dumps.wikimedia.org/other/pagecounts-all-sites/
On 03/21/2016 10:11 AM, Michal Bystricky wrote: