Nuria, thank you for pointing out that exporting a save flag for each request will be complicated. I wasn't aware of that.
It would be very interesting to learn how the previous data set's save flag was exported back in 2007.
Maybe it would be possible to derive a save flag with data already available to the analytics infrastructure (stat1002's requests streams). Here are two naive ideas.
1) In Wikimedia's cache log format [1], I can see that the request method (%m) is logged. Wouldn't the request method allow us to detect POST requests and thus setting the save flag? Maybe the log even includes the PURGE requests triggered by a save operation?
2) We can try detecting object updates by changes in their size. Specifically, we would need to know the response size and whether the response was gzipped. Without knowing whether a response was gzipped we might be detecting many spurious object updates. Unfortunately, it seems that the cache log format [1] does not include the Content-Encoding so that we would be able to detect gzipped responses?
Best, Daniel
[1] https://wikitech.wikimedia.org/wiki/Cache_log_format
On 02/24/2016 09:59 PM, Nuria Ruiz wrote:
(cc-ing Tim starling who is credited on your dataset page and might know more about this)
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
As far as I can see the prior dataset contained the following:
Counter, timestamp, url, save flag
929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png - 929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png save
I can see how we could get a dataset with timestamp and url and adding a counter is something it can be done (on our actual system though ordering of requests is not guranteed in logs). Now, I really do not know whether it is possible to add a flag of whether the request was a save or not. As far as I know that is not information we have on our current system and it seems that it will require tapping into the cache lookups to get that info. Meaning that you would need to get that info from varnish lookups as requests are happening which is before analytics systems get any of the data.
Anyways I hope other folks can chime in on how/whether this can be done somewhat easily, it certainly requires access to other parts of the stack besides analytics infrastructure.
Thanks,
Nuria
On Wed, Feb 24, 2016 at 3:05 AM, Daniel Berger <berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de> wrote:
Hi everyone, I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions. Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. I would like to ask for your comments about compiling a similar (updated) data set and making it public. In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that. The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers. Let me know your thoughts. Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger [1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics