Following up on this thread to keep archives happy. dataset has been compiled and it is avialable here: https://datasets.wikimedia.org/public-datasets/analytics/caching/
If you are interested caching data please read ticket as there are data nuances you should know about: https://phabricator.wikimedia.org/T128132
On Thu, Feb 25, 2016 at 1:43 PM, Daniel Berger berger@cs.uni-kl.de wrote:
Alright, the corresponding task can be found here: https://phabricator.wikimedia.org/T128132
Thanks a lot for your help Nuria and Tim! Daniel
On 02/25/2016 09:58 PM, Nuria Ruiz wrote:
How do we proceed from here?
You can open a phabricator item, explain your request, tag it with "analytics" tag and it will go in our backlog.
Phabricator: https://phabricator.wikimedia.org
Our backlog: https://phabricator.wikimedia.org/tag/analytics/
What we are currently working on: https://phabricator.wikimedia.org/tag/analytics-kanban/
Our team focus on infrastructure for analytics rather than compiling "ad-hoc" datasets. Since most requests are about edit or pageview data normally those are either granted by existing datasets, collaborations with research team or analysts working for other teams on the organization. Now, we understand this data request does not fit in either of those so that is why I am suggesting to put it on our backlog and our team will look at it.
Thanks,
Nuria
On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger <berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de> wrote:
Thank you, Nuria, for pointing me to the right doc. This looks great! Do I correctly understand that we can compile a trace with all
requests
(or with a high sampling rate like 1:10) the 'refined' webrequest
data?
We can go without request size. The following fields would be
important
- ts timestamp in ms (to save bytes) - uri_host - uri_path - uri_query needed for save flag - cache_status needed for save flag - http_method needed for save flag - response_size Additionally, it would be interesting to have - hostname to study cache load balancing - sequence to uniquely order requests below ms - content_type to study hit rates per content type - access_method to study hit rates per access type - time_firstbyte for performance/latency comparison - x_cache more cache statistics (cache hierarchy) How do we proceed from here? I guess it would make sense to first look at a tiny data set to
verify
we have what we need. I'm thinking about a few tens of requests? Thanks a lot for your time! Daniel On 02/25/2016 05:55 PM, Nuria Ruiz wrote: > Daniel, > > Took a second look at our dataset (FYI, we have not used sampled
logs
> for a while now for this type of data) and hey, cache_status,
cache_host
> and response size are right there. So, my mistake when I thought
those
> were not included. > > See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest > > So the only thing not available is request_size. No awk is needed
as
> this data is available on hive for the last month. Take a look at
docs
> and let us know. > > Thanks, > > Nuria > > > > On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <
berger@cs.uni-kl.de mailto:berger@cs.uni-kl.de
> <mailto:berger@cs.uni-kl.de <mailto:berger@cs.uni-kl.de>>> wrote: > > Tim, thanks a lot. Your scripts show that we can get everything from the > cache log format. > > > What is the current sampling rate for the cache logs in > /a/log/webrequest/archive? > I understand that wikitech's wiki information > - 1:1000 for the general request stream [1], and > - 1:100 for the mobile request stream [2] > might be outdated? > > The 2007 trace had a 1:10 sampling rate, which means much more data. > Would 1:10 still be feasible today? > > A high sampling rate would be important to reproduce the cache hit ratio > as seen by the varnish caches. However, this depends on how the caches > are load balanced. > If requests get distributed round robin (and there are many caches), > then a 1:100 sampling rate would probably be enough to reproduce their > hit rate. > If, requests get distributed by hashing over URLs (or similar), then we > might need a higher sampling rate (like 1:10) to capture the request > stream's temporal locality. > > > Starting from the fields of the 2007 trace, it would be important to > include > - the request size $7 > and it would be helpful to include > - the cache hostname $1 > - the cache request status $6 > > Building on your awk script, this would be something along > > function savemark(url, code) { > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > return "save" > return "-" > } > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > print $1, $3, $4, $9, $7, savemark($9, $6), $6 > } > > > Would this be an acceptable format? > > Let me know your thoughts. > > > Thanks a lot, > Daniel > > > [1] > https://wikitech.wikimedia.org/wiki/Analytics/Data/
Webrequests_sampled
> > [2] > https://wikitech.wikimedia.org/wiki/Analytics/Data/
Mobile_requests_stream
> > > > > > > On 02/25/2016 12:04 PM, Tim Starling wrote: > > On 25/02/16 21:14, Daniel Berger wrote: > >> Nuria, thank you for pointing out that exporting a save flag for each > >> request will be complicated. I wasn't aware of that. > >> > >> It would be very interesting to learn how the previous data set's > save > >> flag was exported back in 2007. > > > > As I suspected in my offlist post, the save flag was set using the > > HTTP response code. Here are the files as they were when they were > > first committed to version control in 2012. I think they were the same > > in 2007 except for the IP address filter: > > > > vu.awk: > > > > function savemark(url, code) { > > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > > return "save" > > return "-" > > } > > > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > > print $3, $9, savemark($9, $6) > > } > > > > > > urjc.awk: > > > > function savemark(url, code) { > > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > > return "save" > > return "-" > > } > > > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > > print $3, $9, savemark($9, $6), $4, $8 > > } > > > > > > -- Tim Starling > > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.
wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics