Following up on this thread to keep archives happy. dataset has been compiled and it is avialable here: https://datasets.wikimedia.org/public-datasets/analytics/caching/

If you are interested caching data please read ticket as there are data nuances you should know about: https://phabricator.wikimedia.org/T128132


On Thu, Feb 25, 2016 at 1:43 PM, Daniel Berger <berger@cs.uni-kl.de> wrote:
Alright, the corresponding task can be found here:
https://phabricator.wikimedia.org/T128132

Thanks a lot for your help Nuria and Tim!
Daniel


On 02/25/2016 09:58 PM, Nuria Ruiz wrote:
>>How do we proceed from here?
> You can open a phabricator item, explain your request, tag it with
> "analytics" tag and it will go in our backlog.
>
> Phabricator: https://phabricator.wikimedia.org
>
> Our backlog: https://phabricator.wikimedia.org/tag/analytics/
>
> What we are currently working
> on: https://phabricator.wikimedia.org/tag/analytics-kanban/
>
> Our team focus on infrastructure for analytics rather than compiling
> "ad-hoc" datasets. Since most requests are about edit or pageview data
> normally those are either granted by existing datasets, collaborations
> with research team or analysts working for other teams on the
> organization. Now, we understand this data request does not fit in
> either of those so that is why I am suggesting to put it on our backlog
> and our team will look at it.
>
> Thanks,
>
> Nuria
>
>
>
>
> On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger <berger@cs.uni-kl.de
> <mailto:berger@cs.uni-kl.de>> wrote:
>
>     Thank you, Nuria, for pointing me to the right doc. This looks great!
>
>     Do I correctly understand that we can compile a trace with all requests
>     (or with a high sampling rate like 1:10) the 'refined' webrequest data?
>
>     We can go without request size. The following fields would be important
>     - ts           timestamp in ms (to save bytes)
>     - uri_host
>     - uri_path
>     - uri_query     needed for save flag
>     - cache_status  needed for save flag
>     - http_method   needed for save flag
>     - response_size
>
>     Additionally, it would be interesting to have
>     - hostname      to study cache load balancing
>     - sequence      to uniquely order requests below ms
>     - content_type  to study hit rates per content type
>     - access_method   to study hit rates per access type
>     - time_firstbyte  for performance/latency comparison
>     - x_cache       more cache statistics (cache hierarchy)
>
>
>     How do we proceed from here?
>
>     I guess it would make sense to first look at a tiny data set to verify
>     we have what we need. I'm thinking about a few tens of requests?
>
>
>     Thanks a lot for your time!
>     Daniel
>
>
>
>
>     On 02/25/2016 05:55 PM, Nuria Ruiz wrote:
>     > Daniel,
>     >
>     > Took a second look at our dataset (FYI, we have not used sampled logs
>     > for a while now for this type of data) and hey, cache_status, cache_host
>     > and response size are right there. So, my mistake when I thought those
>     > were not included.
>     >
>     > See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
>     >
>     > So the only thing not available is request_size.  No awk is needed as
>     > this data is available on hive for the last month. Take a look at docs
>     > and let us know.
>     >
>     > Thanks,
>     >
>     > Nuria
>     >
>     >
>     >
>     > On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>
>     > <mailto:berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>>> wrote:
>     >
>     >     Tim, thanks a lot. Your scripts show that we can get
>     everything from the
>     >     cache log format.
>     >
>     >
>     >     What is the current sampling rate for the cache logs in
>     >     /a/log/webrequest/archive?
>     >     I understand that wikitech's wiki information
>     >      - 1:1000 for the general request stream [1], and
>     >      - 1:100 for the mobile request stream [2]
>     >     might be outdated?
>     >
>     >     The 2007 trace had a 1:10 sampling rate, which means much more
>     data.
>     >     Would 1:10 still be feasible today?
>     >
>     >     A high sampling rate would be important to reproduce the cache
>     hit ratio
>     >     as seen by the varnish caches. However, this depends on how
>     the caches
>     >     are load balanced.
>     >     If requests get distributed round robin (and there are many
>     caches),
>     >     then a 1:100 sampling rate would probably be enough to
>     reproduce their
>     >     hit rate.
>     >     If, requests get distributed by hashing over URLs (or
>     similar), then we
>     >     might need a higher sampling rate (like 1:10) to capture the
>     request
>     >     stream's temporal locality.
>     >
>     >
>     >     Starting from the fields of the 2007 trace, it would be
>     important to
>     >     include
>     >      - the request size $7
>     >     and it would be helpful to include
>     >      - the cache hostname $1
>     >      - the cache request status $6
>     >
>     >     Building on your awk script, this would be something along
>     >
>     >      function savemark(url, code) {
>     >         if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >             return "save"
>     >         return "-"
>     >      }
>     >
>     >      $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >         print $1, $3, $4, $9, $7, savemark($9, $6), $6
>     >      }
>     >
>     >
>     >     Would this be an acceptable format?
>     >
>     >     Let me know your thoughts.
>     >
>     >
>     >     Thanks a lot,
>     >     Daniel
>     >
>     >
>     >     [1]
>     >
>      https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled
>     >
>     >     [2]
>     >
>      https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
>     >
>     >
>     >
>     >
>     >
>     >
>     >     On 02/25/2016 12:04 PM, Tim Starling wrote:
>     >     > On 25/02/16 21:14, Daniel Berger wrote:
>     >     >> Nuria, thank you for pointing out that exporting a save
>     flag for each
>     >     >> request will be complicated. I wasn't aware of that.
>     >     >>
>     >     >> It would be very interesting to learn how the previous data
>     set's
>     >     save
>     >     >> flag was exported back in 2007.
>     >     >
>     >     > As I suspected in my offlist post, the save flag was set
>     using the
>     >     > HTTP response code. Here are the files as they were when
>     they were
>     >     > first committed to version control in 2012. I think they
>     were the same
>     >     > in 2007 except for the IP address filter:
>     >     >
>     >     > vu.awk:
>     >     >
>     >     > function savemark(url, code) {
>     >     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >     >         return "save"
>     >     >     return "-"
>     >     > }
>     >     >
>     >     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     >     print $3, $9, savemark($9, $6)
>     >     > }
>     >     >
>     >     >
>     >     > urjc.awk:
>     >     >
>     >     > function savemark(url, code) {
>     >     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >     >         return "save"
>     >     >     return "-"
>     >     > }
>     >     >
>     >     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     >     print $3, $9, savemark($9, $6), $4, $8
>     >     > }
>     >     >
>     >     >
>     >     > -- Tim Starling
>     >     >
>     >
>     >     _______________________________________________
>     >     Analytics mailing list
>     >     Analytics@lists.wikimedia.org
>     <mailto:Analytics@lists.wikimedia.org>
>     <mailto:Analytics@lists.wikimedia.org
>     <mailto:Analytics@lists.wikimedia.org>>
>     >     https://lists.wikimedia.org/mailman/listinfo/analytics
>     >
>     >
>     >
>     >
>     > _______________________________________________
>     > Analytics mailing list
>     > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>     > https://lists.wikimedia.org/mailman/listinfo/analytics
>     >
>
>     _______________________________________________
>     Analytics mailing list
>     Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>     https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics