Tim, thanks a lot. Your scripts show that we can get everything from the cache log format.
What is the current sampling rate for the cache logs in /a/log/webrequest/archive? I understand that wikitech's wiki information - 1:1000 for the general request stream [1], and - 1:100 for the mobile request stream [2] might be outdated?
The 2007 trace had a 1:10 sampling rate, which means much more data. Would 1:10 still be feasible today?
A high sampling rate would be important to reproduce the cache hit ratio as seen by the varnish caches. However, this depends on how the caches are load balanced. If requests get distributed round robin (and there are many caches), then a 1:100 sampling rate would probably be enough to reproduce their hit rate. If, requests get distributed by hashing over URLs (or similar), then we might need a higher sampling rate (like 1:10) to capture the request stream's temporal locality.
Starting from the fields of the 2007 trace, it would be important to include - the request size $7 and it would be helpful to include - the cache hostname $1 - the cache request status $6
Building on your awk script, this would be something along
function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }
$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $1, $3, $4, $9, $7, savemark($9, $6), $6 }
Would this be an acceptable format?
Let me know your thoughts.
Thanks a lot, Daniel
[1] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled
[2] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
On 02/25/2016 12:04 PM, Tim Starling wrote:
On 25/02/16 21:14, Daniel Berger wrote:
Nuria, thank you for pointing out that exporting a save flag for each request will be complicated. I wasn't aware of that.
It would be very interesting to learn how the previous data set's save flag was exported back in 2007.
As I suspected in my offlist post, the save flag was set using the HTTP response code. Here are the files as they were when they were first committed to version control in 2012. I think they were the same in 2007 except for the IP address filter:
vu.awk:
function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }
$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6) }
urjc.awk:
function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }
$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6), $4, $8 }
-- Tim Starling