Alright, the corresponding task can be found here:
https://phabricator.wikimedia.org/T128132
Thanks a lot for your help Nuria and Tim!
Daniel
On 02/25/2016 09:58 PM, Nuria Ruiz wrote:
>>How do we proceed from here?
> You can open a phabricator item, explain your request, tag it with
> "analytics" tag and it will go in our backlog.
>
> Phabricator: https://phabricator.wikimedia.org
>
> Our backlog: https://phabricator.wikimedia.org/tag/analytics/
>
> What we are currently working
> on: https://phabricator.wikimedia.org/tag/analytics-kanban/
>
> Our team focus on infrastructure for analytics rather than compiling
> "ad-hoc" datasets. Since most requests are about edit or pageview data
> normally those are either granted by existing datasets, collaborations
> with research team or analysts working for other teams on the
> organization. Now, we understand this data request does not fit in
> either of those so that is why I am suggesting to put it on our backlog
> and our team will look at it.
>
> Thanks,
>
> Nuria
>
>
>
>
> On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger <berger@cs.uni-kl.de
> <mailto:berger@cs.uni-kl.de>> wrote:
>
> Thank you, Nuria, for pointing me to the right doc. This looks great!
>
> Do I correctly understand that we can compile a trace with all requests
> (or with a high sampling rate like 1:10) the 'refined' webrequest data?
>
> We can go without request size. The following fields would be important
> - ts timestamp in ms (to save bytes)
> - uri_host
> - uri_path
> - uri_query needed for save flag
> - cache_status needed for save flag
> - http_method needed for save flag
> - response_size
>
> Additionally, it would be interesting to have
> - hostname to study cache load balancing
> - sequence to uniquely order requests below ms
> - content_type to study hit rates per content type
> - access_method to study hit rates per access type
> - time_firstbyte for performance/latency comparison
> - x_cache more cache statistics (cache hierarchy)
>
>
> How do we proceed from here?
>
> I guess it would make sense to first look at a tiny data set to verify
> we have what we need. I'm thinking about a few tens of requests?
>
>
> Thanks a lot for your time!
> Daniel
>
>
>
>
> On 02/25/2016 05:55 PM, Nuria Ruiz wrote:
> > Daniel,
> >
> > Took a second look at our dataset (FYI, we have not used sampled logs
> > for a while now for this type of data) and hey, cache_status, cache_host
> > and response size are right there. So, my mistake when I thought those
> > were not included.
> >
> > See: https://wikitech.wikimedia.org/wiki/Analytics/Data/ Webrequest
> >
> > So the only thing not available is request_size. No awk is needed as
> > this data is available on hive for the last month. Take a look at docs
> > and let us know.
> >
> > Thanks,
> >
> > Nuria
> >
> >
> >
> > On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>
> <mailto:Analytics@lists.> > <mailto:berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>>> wrote:
> >
> > Tim, thanks a lot. Your scripts show that we can get
> everything from the
> > cache log format.
> >
> >
> > What is the current sampling rate for the cache logs in
> > /a/log/webrequest/archive?
> > I understand that wikitech's wiki information
> > - 1:1000 for the general request stream [1], and
> > - 1:100 for the mobile request stream [2]
> > might be outdated?
> >
> > The 2007 trace had a 1:10 sampling rate, which means much more
> data.
> > Would 1:10 still be feasible today?
> >
> > A high sampling rate would be important to reproduce the cache
> hit ratio
> > as seen by the varnish caches. However, this depends on how
> the caches
> > are load balanced.
> > If requests get distributed round robin (and there are many
> caches),
> > then a 1:100 sampling rate would probably be enough to
> reproduce their
> > hit rate.
> > If, requests get distributed by hashing over URLs (or
> similar), then we
> > might need a higher sampling rate (like 1:10) to capture the
> request
> > stream's temporal locality.
> >
> >
> > Starting from the fields of the 2007 trace, it would be
> important to
> > include
> > - the request size $7
> > and it would be helpful to include
> > - the cache hostname $1
> > - the cache request status $6
> >
> > Building on your awk script, this would be something along
> >
> > function savemark(url, code) {
> > if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> > return "save"
> > return "-"
> > }
> >
> > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> > print $1, $3, $4, $9, $7, savemark($9, $6), $6
> > }
> >
> >
> > Would this be an acceptable format?
> >
> > Let me know your thoughts.
> >
> >
> > Thanks a lot,
> > Daniel
> >
> >
> > [1]
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data/ Webrequests_sampled
> >
> > [2]
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data/ Mobile_requests_stream
> >
> >
> >
> >
> >
> >
> > On 02/25/2016 12:04 PM, Tim Starling wrote:
> > > On 25/02/16 21:14, Daniel Berger wrote:
> > >> Nuria, thank you for pointing out that exporting a save
> flag for each
> > >> request will be complicated. I wasn't aware of that.
> > >>
> > >> It would be very interesting to learn how the previous data
> set's
> > save
> > >> flag was exported back in 2007.
> > >
> > > As I suspected in my offlist post, the save flag was set
> using the
> > > HTTP response code. Here are the files as they were when
> they were
> > > first committed to version control in 2012. I think they
> were the same
> > > in 2007 except for the IP address filter:
> > >
> > > vu.awk:
> > >
> > > function savemark(url, code) {
> > > if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> > > return "save"
> > > return "-"
> > > }
> > >
> > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> > > print $3, $9, savemark($9, $6)
> > > }
> > >
> > >
> > > urjc.awk:
> > >
> > > function savemark(url, code) {
> > > if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> > > return "save"
> > > return "-"
> > > }
> > >
> > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> > > print $3, $9, savemark($9, $6), $4, $8
> > > }
> > >
> > >
> > > -- Tim Starling
> > >
> >
> > _______________________________________________
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> <mailto:Analytics@lists.wikimedia.org >
wikimedia.org
> <mailto:Analytics@lists.wikimedia.org >>
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> >
> >
> > _______________________________________________
> > Analytics mailing list
> > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org >
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org >
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics