Request stream data set for cache tuning

List overview All Threads
Download

newer

older

Recent cross-language view stats

WikiConference North America -...

Daniel Berger

24 Feb 2016 24 Feb '16

7:05 p.m.

Hi everyone,

I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions.

Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2].

I would like to ask for your comments about compiling a similar (updated) data set and making it public.

In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that.

The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers.

Let me know your thoughts.

Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger

[1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

Attachments:

attachment.htm (text/html — 2.4 KB)

Show replies by date

Nuria Ruiz

25 Feb 25 Feb

4:59 a.m.

(cc-ing Tim starling who is credited on your dataset page and might know more about this)

...

I would like to ask for your comments about compiling a similar (updated)

data set and making it public.

As far as I can see the prior dataset contained the following:

Counter, timestamp, url, save flag

929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png - 929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png save

I can see how we could get a dataset with timestamp and url and adding a counter is something it can be done (on our actual system though ordering of requests is not guranteed in logs). Now, I really do not know whether it is possible to add a flag of whether the request was a save or not. As far as I know that is not information we have on our current system and it seems that it will require tapping into the cache lookups to get that info. Meaning that you would need to get that info from varnish lookups as requests are happening which is before analytics systems get any of the data.

Anyways I hope other folks can chime in on how/whether this can be done somewhat easily, it certainly requires access to other parts of the stack besides analytics infrastructure.

Thanks,

Nuria

On Wed, Feb 24, 2016 at 3:05 AM, Daniel Berger berger@cs.uni-kl.de wrote:

...

Hi everyone,

I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions.

Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2].

I would like to ask for your comments about compiling a similar (updated) data set and making it public.

In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that.

The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers.

Let me know your thoughts.

Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger

[1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Daniel Berger

6:14 p.m.

Nuria, thank you for pointing out that exporting a save flag for each request will be complicated. I wasn't aware of that.

It would be very interesting to learn how the previous data set's save flag was exported back in 2007.

Maybe it would be possible to derive a save flag with data already available to the analytics infrastructure (stat1002's requests streams). Here are two naive ideas.

1) In Wikimedia's cache log format [1], I can see that the request method (%m) is logged. Wouldn't the request method allow us to detect POST requests and thus setting the save flag? Maybe the log even includes the PURGE requests triggered by a save operation?

2) We can try detecting object updates by changes in their size. Specifically, we would need to know the response size and whether the response was gzipped. Without knowing whether a response was gzipped we might be detecting many spurious object updates. Unfortunately, it seems that the cache log format [1] does not include the Content-Encoding so that we would be able to detect gzipped responses?

Best, Daniel

[1] https://wikitech.wikimedia.org/wiki/Cache_log_format

On 02/24/2016 09:59 PM, Nuria Ruiz wrote:

...

(cc-ing Tim starling who is credited on your dataset page and might know more about this)

...
I would like to ask for your comments about compiling a similar

(updated) data set and making it public.

As far as I can see the prior dataset contained the following:

Counter, timestamp, url, save flag

929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png - 929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png save

I can see how we could get a dataset with timestamp and url and adding a counter is something it can be done (on our actual system though ordering of requests is not guranteed in logs). Now, I really do not know whether it is possible to add a flag of whether the request was a save or not. As far as I know that is not information we have on our current system and it seems that it will require tapping into the cache lookups to get that info. Meaning that you would need to get that info from varnish lookups as requests are happening which is before analytics systems get any of the data.

Anyways I hope other folks can chime in on how/whether this can be done somewhat easily, it certainly requires access to other parts of the stack besides analytics infrastructure.

Thanks,

Nuria

On Wed, Feb 24, 2016 at 3:05 AM, Daniel Berger <berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de> wrote:
Hi everyone,

I'm a phd student studying mathematical models to improve the hit
ratio of web caches. In my research community, we are lacking
realistic data sets and frequently rely on outdated modelling
assumptions.

Previously, (~2007) a trace containing 10% of user requests issued
to the Wikipedia was publicly released [1]. This data set has been
used widely for performance evaluations of new caching algorithms,
e.g., for the new Caffeine caching framework for Java [2].

I would like to ask for your comments about compiling a similar
(updated) data set and making it public.


In my understanding, the necessary logs are readily available, e.g.,
in the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive
data (e.g., client IPs), it would need anonymization before making
it public. It would be glad to help with that.

The previously released data set [1] contains no client information.
It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an
update flag. I would additionally suggest to include 5) the cache's
hostname, 6) the cache_status, and 7) the response size (from the
Wikimedia cache log format).
I believe this format would preserve anonymity, and would be
interesting for many researchers.

Let me know your thoughts.

Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger

[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Tim Starling

7:04 p.m.

On 25/02/16 21:14, Daniel Berger wrote:

...

Nuria, thank you for pointing out that exporting a save flag for each request will be complicated. I wasn't aware of that.

It would be very interesting to learn how the previous data set's save flag was exported back in 2007.

As I suspected in my offlist post, the save flag was set using the HTTP response code. Here are the files as they were when they were first committed to version control in 2012. I think they were the same in 2007 except for the IP address filter:

vu.awk:

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6) }

urjc.awk:

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6), $4, $8 }

-- Tim Starling

Daniel Berger

11:50 p.m.

Tim, thanks a lot. Your scripts show that we can get everything from the cache log format.

What is the current sampling rate for the cache logs in /a/log/webrequest/archive? I understand that wikitech's wiki information - 1:1000 for the general request stream [1], and - 1:100 for the mobile request stream [2] might be outdated?

The 2007 trace had a 1:10 sampling rate, which means much more data. Would 1:10 still be feasible today?

A high sampling rate would be important to reproduce the cache hit ratio as seen by the varnish caches. However, this depends on how the caches are load balanced. If requests get distributed round robin (and there are many caches), then a 1:100 sampling rate would probably be enough to reproduce their hit rate. If, requests get distributed by hashing over URLs (or similar), then we might need a higher sampling rate (like 1:10) to capture the request stream's temporal locality.

Starting from the fields of the 2007 trace, it would be important to include - the request size $7 and it would be helpful to include - the cache hostname $1 - the cache request status $6

Building on your awk script, this would be something along

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $1, $3, $4, $9, $7, savemark($9, $6), $6 }

Would this be an acceptable format?

Let me know your thoughts.

Thanks a lot, Daniel

[1] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

[2] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

On 02/25/2016 12:04 PM, Tim Starling wrote:

...

On 25/02/16 21:14, Daniel Berger wrote:

...
Nuria, thank you for pointing out that exporting a save flag for each request will be complicated. I wasn't aware of that.

It would be very interesting to learn how the previous data set's save flag was exported back in 2007.

As I suspected in my offlist post, the save flag was set using the HTTP response code. Here are the files as they were when they were first committed to version control in 2012. I think they were the same in 2007 except for the IP address filter:

vu.awk:

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6) }

urjc.awk:

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6), $4, $8 }

-- Tim Starling

Nuria Ruiz

26 Feb 26 Feb

12:55 a.m.

Daniel,

Took a second look at our dataset (FYI, we have not used sampled logs for a while now for this type of data) and hey, cache_status, cache_host and response size are right there. So, my mistake when I thought those were not included.

See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

So the only thing not available is request_size. No awk is needed as this data is available on hive for the last month. Take a look at docs and let us know.

Thanks,

Nuria

On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger berger@cs.uni-kl.de wrote:

...

Tim, thanks a lot. Your scripts show that we can get everything from the cache log format.

What is the current sampling rate for the cache logs in /a/log/webrequest/archive? I understand that wikitech's wiki information

1:1000 for the general request stream [1], and

1:100 for the mobile request stream [2]

might be outdated?

The 2007 trace had a 1:10 sampling rate, which means much more data. Would 1:10 still be feasible today?

A high sampling rate would be important to reproduce the cache hit ratio as seen by the varnish caches. However, this depends on how the caches are load balanced. If requests get distributed round robin (and there are many caches), then a 1:100 sampling rate would probably be enough to reproduce their hit rate. If, requests get distributed by hashing over URLs (or similar), then we might need a higher sampling rate (like 1:10) to capture the request stream's temporal locality.

Starting from the fields of the 2007 trace, it would be important to include

the request size $7

and it would be helpful to include

the cache hostname $1

the cache request status $6

Building on your awk script, this would be something along

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $1, $3, $4, $9, $7, savemark($9, $6), $6 }

Would this be an acceptable format?

Let me know your thoughts.

Thanks a lot, Daniel

[1] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

[2] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

On 02/25/2016 12:04 PM, Tim Starling wrote:

...
On 25/02/16 21:14, Daniel Berger wrote:

...
Nuria, thank you for pointing out that exporting a save flag for each request will be complicated. I wasn't aware of that.

It would be very interesting to learn how the previous data set's save flag was exported back in 2007.

As I suspected in my offlist post, the save flag was set using the HTTP response code. Here are the files as they were when they were first committed to version control in 2012. I think they were the same in 2007 except for the IP address filter:

vu.awk:

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6) }

urjc.awk:

function savemark(url, code) { if (url ~ /action=submit$/ && code == "TCP_MISS/302") return "save" return "-" }

$5 !~ /^(145.97.39.|66.230.200.|211.115.107.)/ { print $3, $9, savemark($9, $6), $4, $8 }

-- Tim Starling

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Daniel Berger

4:42 a.m.

Thank you, Nuria, for pointing me to the right doc. This looks great!

Do I correctly understand that we can compile a trace with all requests (or with a high sampling rate like 1:10) the 'refined' webrequest data?

We can go without request size. The following fields would be important - ts timestamp in ms (to save bytes) - uri_host - uri_path - uri_query needed for save flag - cache_status needed for save flag - http_method needed for save flag - response_size

Additionally, it would be interesting to have - hostname to study cache load balancing - sequence to uniquely order requests below ms - content_type to study hit rates per content type - access_method to study hit rates per access type - time_firstbyte for performance/latency comparison - x_cache more cache statistics (cache hierarchy)

How do we proceed from here?

I guess it would make sense to first look at a tiny data set to verify we have what we need. I'm thinking about a few tens of requests?

Thanks a lot for your time! Daniel

On 02/25/2016 05:55 PM, Nuria Ruiz wrote:

...

Daniel,

See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

So the only thing not available is request_size. No awk is needed as this data is available on hive for the last month. Take a look at docs and let us know.

Thanks,

Nuria

On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de> wrote:

Tim, thanks a lot. Your scripts show that we can get everything from the
cache log format.


What is the current sampling rate for the cache logs in
/a/log/webrequest/archive?
I understand that wikitech's wiki information
 - 1:1000 for the general request stream [1], and
 - 1:100 for the mobile request stream [2]
might be outdated?

The 2007 trace had a 1:10 sampling rate, which means much more data.
Would 1:10 still be feasible today?

A high sampling rate would be important to reproduce the cache hit ratio
as seen by the varnish caches. However, this depends on how the caches
are load balanced.
If requests get distributed round robin (and there are many caches),
then a 1:100 sampling rate would probably be enough to reproduce their
hit rate.
If, requests get distributed by hashing over URLs (or similar), then we
might need a higher sampling rate (like 1:10) to capture the request
stream's temporal locality.


Starting from the fields of the 2007 trace, it would be important to
include
 - the request size $7
and it would be helpful to include
 - the cache hostname $1
 - the cache request status $6

Building on your awk script, this would be something along

 function savemark(url, code) {
    if (url ~ /action=submit$/ && code == "TCP_MISS/302")
        return "save"
    return "-"
 }

 $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
    print $1, $3, $4, $9, $7, savemark($9, $6), $6
 }


Would this be an acceptable format?

Let me know your thoughts.


Thanks a lot,
Daniel


[1]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream






On 02/25/2016 12:04 PM, Tim Starling wrote:
> On 25/02/16 21:14, Daniel Berger wrote:
>> Nuria, thank you for pointing out that exporting a save flag for each
>> request will be complicated. I wasn't aware of that.
>>
>> It would be very interesting to learn how the previous data set's
save
>> flag was exported back in 2007.
>
> As I suspected in my offlist post, the save flag was set using the
> HTTP response code. Here are the files as they were when they were
> first committed to version control in 2012. I think they were the same
> in 2007 except for the IP address filter:
>
> vu.awk:
>
> function savemark(url, code) {
>     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>         return "save"
>     return "-"
> }
>
> $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     print $3, $9, savemark($9, $6)
> }
>
>
> urjc.awk:
>
> function savemark(url, code) {
>     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>         return "save"
>     return "-"
> }
>
> $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     print $3, $9, savemark($9, $6), $4, $8
> }
>
>
> -- Tim Starling
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

4:58 a.m.

...

How do we proceed from here?

You can open a phabricator item, explain your request, tag it with "analytics" tag and it will go in our backlog.

Phabricator: https://phabricator.wikimedia.org

Our backlog: https://phabricator.wikimedia.org/tag/analytics/

What we are currently working on: https://phabricator.wikimedia.org/tag/analytics-kanban/

Our team focus on infrastructure for analytics rather than compiling "ad-hoc" datasets. Since most requests are about edit or pageview data normally those are either granted by existing datasets, collaborations with research team or analysts working for other teams on the organization. Now, we understand this data request does not fit in either of those so that is why I am suggesting to put it on our backlog and our team will look at it.

Thanks,

Nuria

On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger berger@cs.uni-kl.de wrote:

...

Thank you, Nuria, for pointing me to the right doc. This looks great!

Do I correctly understand that we can compile a trace with all requests (or with a high sampling rate like 1:10) the 'refined' webrequest data?

We can go without request size. The following fields would be important

ts timestamp in ms (to save bytes)

uri_host

uri_path

uri_query needed for save flag

cache_status needed for save flag

http_method needed for save flag

response_size

Additionally, it would be interesting to have

hostname to study cache load balancing

sequence to uniquely order requests below ms

content_type to study hit rates per content type

access_method to study hit rates per access type

time_firstbyte for performance/latency comparison

x_cache more cache statistics (cache hierarchy)

How do we proceed from here?

I guess it would make sense to first look at a tiny data set to verify we have what we need. I'm thinking about a few tens of requests?

Thanks a lot for your time! Daniel

On 02/25/2016 05:55 PM, Nuria Ruiz wrote:

...
Daniel,

Took a second look at our dataset (FYI, we have not used sampled logs for a while now for this type of data) and hey, cache_status, cache_host and response size are right there. So, my mistake when I thought those were not included.

See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

So the only thing not available is request_size. No awk is needed as this data is available on hive for the last month. Take a look at docs and let us know.

Thanks,

Nuria

On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de> wrote:
Tim, thanks a lot. Your scripts show that we can get everything from
the

...
cache log format.


What is the current sampling rate for the cache logs in
/a/log/webrequest/archive?
I understand that wikitech's wiki information
 - 1:1000 for the general request stream [1], and
 - 1:100 for the mobile request stream [2]
might be outdated?

The 2007 trace had a 1:10 sampling rate, which means much more data.
Would 1:10 still be feasible today?

A high sampling rate would be important to reproduce the cache hit
ratio

...
as seen by the varnish caches. However, this depends on how the
caches

...
are load balanced.
If requests get distributed round robin (and there are many caches),
then a 1:100 sampling rate would probably be enough to reproduce
their

...
hit rate.
If, requests get distributed by hashing over URLs (or similar), then
we

...
might need a higher sampling rate (like 1:10) to capture the request
stream's temporal locality.


Starting from the fields of the 2007 trace, it would be important to
include
 - the request size $7
and it would be helpful to include
 - the cache hostname $1
 - the cache request status $6

Building on your awk script, this would be something along

 function savemark(url, code) {
    if (url ~ /action=submit$/ && code == "TCP_MISS/302")
        return "save"
    return "-"
 }

 $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
    print $1, $3, $4, $9, $7, savemark($9, $6), $6
 }


Would this be an acceptable format?

Let me know your thoughts.


Thanks a lot,
Daniel


[1]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

...
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

...
On 02/25/2016 12:04 PM, Tim Starling wrote:
> On 25/02/16 21:14, Daniel Berger wrote:
>> Nuria, thank you for pointing out that exporting a save flag for
each

...
>> request will be complicated. I wasn't aware of that.
>>
>> It would be very interesting to learn how the previous data set's
save
>> flag was exported back in 2007.
>
> As I suspected in my offlist post, the save flag was set using the
> HTTP response code. Here are the files as they were when they were
> first committed to version control in 2012. I think they were the
same

...
> in 2007 except for the IP address filter:
>
> vu.awk:
>
> function savemark(url, code) {
>     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>         return "save"
>     return "-"
> }
>
> $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     print $3, $9, savemark($9, $6)
> }
>
>
> urjc.awk:
>
> function savemark(url, code) {
>     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>         return "save"
>     return "-"
> }
>
> $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     print $3, $9, savemark($9, $6), $4, $8
> }
>
>
> -- Tim Starling
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Daniel Berger

5:43 a.m.

Alright, the corresponding task can be found here: https://phabricator.wikimedia.org/T128132

Thanks a lot for your help Nuria and Tim! Daniel

On 02/25/2016 09:58 PM, Nuria Ruiz wrote:

...

How do we proceed from here?

You can open a phabricator item, explain your request, tag it with "analytics" tag and it will go in our backlog.

Phabricator: https://phabricator.wikimedia.org

Our backlog: https://phabricator.wikimedia.org/tag/analytics/

What we are currently working on: https://phabricator.wikimedia.org/tag/analytics-kanban/

Thanks,

Nuria

On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger <berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de> wrote:

Thank you, Nuria, for pointing me to the right doc. This looks great!

Do I correctly understand that we can compile a trace with all requests
(or with a high sampling rate like 1:10) the 'refined' webrequest data?

We can go without request size. The following fields would be important
- ts           timestamp in ms (to save bytes)
- uri_host
- uri_path
- uri_query     needed for save flag
- cache_status  needed for save flag
- http_method   needed for save flag
- response_size

Additionally, it would be interesting to have
- hostname      to study cache load balancing
- sequence      to uniquely order requests below ms
- content_type  to study hit rates per content type
- access_method   to study hit rates per access type
- time_firstbyte  for performance/latency comparison
- x_cache       more cache statistics (cache hierarchy)


How do we proceed from here?

I guess it would make sense to first look at a tiny data set to verify
we have what we need. I'm thinking about a few tens of requests?


Thanks a lot for your time!
Daniel




On 02/25/2016 05:55 PM, Nuria Ruiz wrote:
> Daniel,
>
> Took a second look at our dataset (FYI, we have not used sampled logs
> for a while now for this type of data) and hey, cache_status, cache_host
> and response size are right there. So, my mistake when I thought those
> were not included.
>
> See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
>
> So the only thing not available is request_size.  No awk is needed as
> this data is available on hive for the last month. Take a look at docs
> and let us know.
>
> Thanks,
>
> Nuria
>
>
>
> On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>
> <mailto:berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>>> wrote:
>
>     Tim, thanks a lot. Your scripts show that we can get
everything from the
>     cache log format.
>
>
>     What is the current sampling rate for the cache logs in
>     /a/log/webrequest/archive?
>     I understand that wikitech's wiki information
>      - 1:1000 for the general request stream [1], and
>      - 1:100 for the mobile request stream [2]
>     might be outdated?
>
>     The 2007 trace had a 1:10 sampling rate, which means much more
data.
>     Would 1:10 still be feasible today?
>
>     A high sampling rate would be important to reproduce the cache
hit ratio
>     as seen by the varnish caches. However, this depends on how
the caches
>     are load balanced.
>     If requests get distributed round robin (and there are many
caches),
>     then a 1:100 sampling rate would probably be enough to
reproduce their
>     hit rate.
>     If, requests get distributed by hashing over URLs (or
similar), then we
>     might need a higher sampling rate (like 1:10) to capture the
request
>     stream's temporal locality.
>
>
>     Starting from the fields of the 2007 trace, it would be
important to
>     include
>      - the request size $7
>     and it would be helpful to include
>      - the cache hostname $1
>      - the cache request status $6
>
>     Building on your awk script, this would be something along
>
>      function savemark(url, code) {
>         if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>             return "save"
>         return "-"
>      }
>
>      $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>         print $1, $3, $4, $9, $7, savemark($9, $6), $6
>      }
>
>
>     Would this be an acceptable format?
>
>     Let me know your thoughts.
>
>
>     Thanks a lot,
>     Daniel
>
>
>     [1]
>   
 https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled
>
>     [2]
>   
 https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
>
>
>
>
>
>
>     On 02/25/2016 12:04 PM, Tim Starling wrote:
>     > On 25/02/16 21:14, Daniel Berger wrote:
>     >> Nuria, thank you for pointing out that exporting a save
flag for each
>     >> request will be complicated. I wasn't aware of that.
>     >>
>     >> It would be very interesting to learn how the previous data
set's
>     save
>     >> flag was exported back in 2007.
>     >
>     > As I suspected in my offlist post, the save flag was set
using the
>     > HTTP response code. Here are the files as they were when
they were
>     > first committed to version control in 2012. I think they
were the same
>     > in 2007 except for the IP address filter:
>     >
>     > vu.awk:
>     >
>     > function savemark(url, code) {
>     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >         return "save"
>     >     return "-"
>     > }
>     >
>     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     print $3, $9, savemark($9, $6)
>     > }
>     >
>     >
>     > urjc.awk:
>     >
>     > function savemark(url, code) {
>     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >         return "save"
>     >     return "-"
>     > }
>     >
>     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     print $3, $9, savemark($9, $6), $4, $8
>     > }
>     >
>     >
>     > -- Tim Starling
>     >
>
>     _______________________________________________
>     Analytics mailing list
>     Analytics@lists.wikimedia.org
<mailto:Analytics@lists.wikimedia.org>
<mailto:Analytics@lists.wikimedia.org
<mailto:Analytics@lists.wikimedia.org>>
>     https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

1 Sep 1 Sep

12:40 a.m.

Following up on this thread to keep archives happy. dataset has been compiled and it is avialable here: https://datasets.wikimedia.org/public-datasets/analytics/caching/

If you are interested caching data please read ticket as there are data nuances you should know about: https://phabricator.wikimedia.org/T128132

On Thu, Feb 25, 2016 at 1:43 PM, Daniel Berger berger@cs.uni-kl.de wrote:

...

Alright, the corresponding task can be found here: https://phabricator.wikimedia.org/T128132

Thanks a lot for your help Nuria and Tim! Daniel

On 02/25/2016 09:58 PM, Nuria Ruiz wrote:

...

...
How do we proceed from here?

You can open a phabricator item, explain your request, tag it with "analytics" tag and it will go in our backlog.

Phabricator: https://phabricator.wikimedia.org

Our backlog: https://phabricator.wikimedia.org/tag/analytics/

What we are currently working on: https://phabricator.wikimedia.org/tag/analytics-kanban/

Our team focus on infrastructure for analytics rather than compiling "ad-hoc" datasets. Since most requests are about edit or pageview data normally those are either granted by existing datasets, collaborations with research team or analysts working for other teams on the organization. Now, we understand this data request does not fit in either of those so that is why I am suggesting to put it on our backlog and our team will look at it.

Thanks,

Nuria

On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger <berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de> wrote:
Thank you, Nuria, for pointing me to the right doc. This looks great!

Do I correctly understand that we can compile a trace with all

requests

...

(or with a high sampling rate like 1:10) the 'refined' webrequest

data?

...

We can go without request size. The following fields would be

important

...

- ts           timestamp in ms (to save bytes)
- uri_host
- uri_path
- uri_query     needed for save flag
- cache_status  needed for save flag
- http_method   needed for save flag
- response_size

Additionally, it would be interesting to have
- hostname      to study cache load balancing
- sequence      to uniquely order requests below ms
- content_type  to study hit rates per content type
- access_method   to study hit rates per access type
- time_firstbyte  for performance/latency comparison
- x_cache       more cache statistics (cache hierarchy)


How do we proceed from here?

I guess it would make sense to first look at a tiny data set to

verify

...

we have what we need. I'm thinking about a few tens of requests?


Thanks a lot for your time!
Daniel




On 02/25/2016 05:55 PM, Nuria Ruiz wrote:
> Daniel,
>
> Took a second look at our dataset (FYI, we have not used sampled

logs

...

> for a while now for this type of data) and hey, cache_status,

cache_host

...

> and response size are right there. So, my mistake when I thought

those

...

> were not included.
>
> See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
>
> So the only thing not available is request_size.  No awk is needed

...

> this data is available on hive for the last month. Take a look at

docs

...

> and let us know.
>
> Thanks,
>
> Nuria
>
>
>
> On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <

berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de

...

> <mailto:berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>>> wrote:
>
>     Tim, thanks a lot. Your scripts show that we can get
everything from the
>     cache log format.
>
>
>     What is the current sampling rate for the cache logs in
>     /a/log/webrequest/archive?
>     I understand that wikitech's wiki information
>      - 1:1000 for the general request stream [1], and
>      - 1:100 for the mobile request stream [2]
>     might be outdated?
>
>     The 2007 trace had a 1:10 sampling rate, which means much more
data.
>     Would 1:10 still be feasible today?
>
>     A high sampling rate would be important to reproduce the cache
hit ratio
>     as seen by the varnish caches. However, this depends on how
the caches
>     are load balanced.
>     If requests get distributed round robin (and there are many
caches),
>     then a 1:100 sampling rate would probably be enough to
reproduce their
>     hit rate.
>     If, requests get distributed by hashing over URLs (or
similar), then we
>     might need a higher sampling rate (like 1:10) to capture the
request
>     stream's temporal locality.
>
>
>     Starting from the fields of the 2007 trace, it would be
important to
>     include
>      - the request size $7
>     and it would be helpful to include
>      - the cache hostname $1
>      - the cache request status $6
>
>     Building on your awk script, this would be something along
>
>      function savemark(url, code) {
>         if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>             return "save"
>         return "-"
>      }
>
>      $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>         print $1, $3, $4, $9, $7, savemark($9, $6), $6
>      }
>
>
>     Would this be an acceptable format?
>
>     Let me know your thoughts.
>
>
>     Thanks a lot,
>     Daniel
>
>
>     [1]
>
 https://wikitech.wikimedia.org/wiki/Analytics/Data/

Webrequests_sampled

...

>
>     [2]
>
 https://wikitech.wikimedia.org/wiki/Analytics/Data/

Mobile_requests_stream

...

>
>
>
>
>
>
>     On 02/25/2016 12:04 PM, Tim Starling wrote:
>     > On 25/02/16 21:14, Daniel Berger wrote:
>     >> Nuria, thank you for pointing out that exporting a save
flag for each
>     >> request will be complicated. I wasn't aware of that.
>     >>
>     >> It would be very interesting to learn how the previous data
set's
>     save
>     >> flag was exported back in 2007.
>     >
>     > As I suspected in my offlist post, the save flag was set
using the
>     > HTTP response code. Here are the files as they were when
they were
>     > first committed to version control in 2012. I think they
were the same
>     > in 2007 except for the IP address filter:
>     >
>     > vu.awk:
>     >
>     > function savemark(url, code) {
>     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >         return "save"
>     >     return "-"
>     > }
>     >
>     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     print $3, $9, savemark($9, $6)
>     > }
>     >
>     >
>     > urjc.awk:
>     >
>     > function savemark(url, code) {
>     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >         return "save"
>     >     return "-"
>     > }
>     >
>     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     print $3, $9, savemark($9, $6), $4, $8
>     > }
>     >
>     >
>     > -- Tim Starling
>     >
>
>     _______________________________________________
>     Analytics mailing list
>     Analytics@lists.wikimedia.org
<mailto:Analytics@lists.wikimedia.org>
<mailto:Analytics@lists.wikimedia.org
<mailto:Analytics@lists.wikimedia.org>>
>     https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.

wikimedia.org>

...

> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3049

Age (days ago)

3238

Last active (days ago)

analytics@lists.wikimedia.org

9 comments

3 participants

tags (0)

participants (3)

Daniel Berger
Nuria Ruiz
Tim Starling