Nuria, thank you for pointing out that exporting a save flag for each
request will be complicated. I wasn't aware of that.
It would be very interesting to learn how the previous data set's save
flag was exported back in 2007.
Maybe it would be possible to derive a save flag with data already
available to the analytics infrastructure (stat1002's requests streams).
Here are two naive ideas.
1) In Wikimedia's cache log format [1], I can see that the request
method (%m) is logged. Wouldn't the request method allow us to detect
POST requests and thus setting the save flag?
Maybe the log even includes the PURGE requests triggered by a save
operation?
2) We can try detecting object updates by changes in their size.
Specifically, we would need to know the response size and whether the
response was gzipped. Without knowing whether a response was gzipped we
might be detecting many spurious object updates.
Unfortunately, it seems that the cache log format [1] does not include
the Content-Encoding so that we would be able to detect gzipped responses?
Best,
Daniel
[1]
(cc-ing Tim starling who is credited on your dataset
page and might know
more about this)
I would like to ask for your comments about
compiling a similar
(updated) data set and making it public.
As far as I can see the prior dataset contained the following:
Counter, timestamp, url, save flag
929840891 1190146243.303
http://en.wikipedia.org/images/wiki-en.png -
929840891 1190146243.303
http://en.wikipedia.org/images/wiki-en.png save
I can see how we could get a dataset with timestamp and url and adding a
counter is something it can be done (on our actual system though
ordering of requests is not guranteed in logs). Now, I really do not
know whether it is possible to add a flag of whether the request was a
save or not. As far as I know that is not information we have on our
current system and it seems that it will require tapping into the cache
lookups to get that info. Meaning that you would need to get that info
from varnish lookups as requests are happening which is before analytics
systems get any of the data.
Anyways I hope other folks can chime in on how/whether this can be done
somewhat easily, it certainly requires access to other parts of the
stack besides analytics infrastructure.
Thanks,
Nuria
On Wed, Feb 24, 2016 at 3:05 AM, Daniel Berger <berger(a)cs.uni-kl.de
<mailto:berger@cs.uni-kl.de>> wrote:
Hi everyone,
I'm a phd student studying mathematical models to improve the hit
ratio of web caches. In my research community, we are lacking
realistic data sets and frequently rely on outdated modelling
assumptions.
Previously, (~2007) a trace containing 10% of user requests issued
to the Wikipedia was publicly released [1]. This data set has been
used widely for performance evaluations of new caching algorithms,
e.g., for the new Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g.,
in the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive
data (e.g., client IPs), it would need anonymization before making
it public. It would be glad to help with that.
The previously released data set [1] contains no client information.
It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an
update flag. I would additionally suggest to include 5) the cache's
hostname, 6) the cache_status, and 7) the response size (from the
Wikimedia cache log format).
I believe this format would preserve anonymity, and would be
interesting for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1]
http://www.wikibench.eu/?page_id=60
[2]
https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics