(cc-ing Tim starling who is credited on your dataset page and might know
more about this)
I would like to ask for your comments about compiling a
similar (updated)
data set and making it public.
As far as I can see the prior dataset contained the following:
Counter, timestamp, url, save flag
929840891 1190146243.303
http://en.wikipedia.org/images/wiki-en.png -
929840891 1190146243.303
http://en.wikipedia.org/images/wiki-en.png save
I can see how we could get a dataset with timestamp and url and adding a
counter is something it can be done (on our actual system though ordering
of requests is not guranteed in logs). Now, I really do not know whether it
is possible to add a flag of whether the request was a save or not. As far
as I know that is not information we have on our current system and it
seems that it will require tapping into the cache lookups to get that info.
Meaning that you would need to get that info from varnish lookups as
requests are happening which is before analytics systems get any of the
data.
Anyways I hope other folks can chime in on how/whether this can be done
somewhat easily, it certainly requires access to other parts of the stack
besides analytics infrastructure.
Thanks,
Nuria
On Wed, Feb 24, 2016 at 3:05 AM, Daniel Berger <berger(a)cs.uni-kl.de> wrote:
Hi everyone,
I'm a phd student studying mathematical models to improve the hit ratio of
web caches. In my research community, we are lacking realistic data sets
and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to the
Wikipedia was publicly released [1]. This data set has been used widely for
performance evaluations of new caching algorithms, e.g., for the new
Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar (updated)
data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling
rate of 1:100. As this request stream contains sensitive data (e.g., client
IPs), it would need anonymization before making it public. It would be glad
to help with that.
The previously released data set [1] contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I
would additionally suggest to include 5) the cache's hostname, 6) the
cache_status, and 7) the response size (from the Wikimedia cache log
format).
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1]
http://www.wikibench.eu/?page_id=60
[2]
https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics