Hi everyone,
I'm a phd student studying mathematical models to improve the hit
ratio of web caches. In my research community, we are lacking
realistic data sets and frequently rely on outdated modelling
assumptions.
Previously, (~2007) a trace containing 10% of user requests issued
to the Wikipedia was publicly released [1]. This data set has been
used widely for performance evaluations of new caching algorithms,
e.g., for the new Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g.,
in the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive
data (e.g., client IPs), it would need anonymization before making
it public. It would be glad to help with that.
The previously released data set [1] contains no client information.
It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an
update flag. I would additionally suggest to include 5) the cache's
hostname, 6) the cache_status, and 7) the response size (from the
Wikimedia cache log format).
I believe this format would preserve anonymity, and would be
interesting for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream