(cc-ing Tim starling who is credited on your dataset page and might know more about this)
>I would like to ask for your comments about compiling a similar (updated) data set and making it public.


As far as I can see the prior dataset contained the following:

Counter, timestamp, url, save flag

929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png -
929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png  save 

I can see how we could get a dataset with timestamp and url and adding a counter is something it can be done (on our actual system though ordering of requests is not guranteed in logs). Now, I really do not know whether it is possible to add a flag of whether the request was a save or not. As far as I know that is not information we have on our current system and it seems that it will require tapping into the cache lookups to get that info. Meaning that you would need to get that info from varnish lookups as requests are happening which is before analytics systems get any of the data.

Anyways I hope other folks can chime in on how/whether this can be done somewhat easily, it certainly requires access to other parts of the stack besides analytics infrastructure.


Thanks, 

Nuria




















On Wed, Feb 24, 2016 at 3:05 AM, Daniel Berger <berger@cs.uni-kl.de> wrote:
Hi everyone,

I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions.

Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2].

I would like to ask for your comments about compiling a similar (updated) data set and making it public.


In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that.

The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format).
I believe this format would preserve anonymity, and would be interesting for many researchers.

Let me know your thoughts.

Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger

[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics