Hi guys!
So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx.
1. Append Accept-Language and X-Carrier headers. This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/
2. Change field separator from space to tab. User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues.
We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon.
Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through?
Thanks! -Andrew Otto
Hi Andrew,
We are almost completely sure that this set of changes would cause a significant amount of havoc with various crucial systems we have in place, and we definitely don't have time to shake bugs out of those systems at this point, as they are all already in heavy use. The only good time to deploy those changes in the forseeable future would be (approximately) January.
Sorry about the bad news, -Katie
On Tue, Nov 13, 2012 at 12:10 PM, Andrew Otto otto@wikimedia.org wrote:
Hi guys!
So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx.
- Append Accept-Language and X-Carrier headers.
This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/
- Change field separator from space to tab.
User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues.
We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon.
Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through?
Thanks! -Andrew Otto
Hi Katie, Could you please give a bit more details regarding "significant amount of havoc with various crucial systems we have in place". Thanks! Diederik
On Tue, Nov 13, 2012 at 3:58 PM, Katie Horn khorn@wikimedia.org wrote:
Hi Andrew,
We are almost completely sure that this set of changes would cause a significant amount of havoc with various crucial systems we have in place, and we definitely don't have time to shake bugs out of those systems at this point, as they are all already in heavy use. The only good time to deploy those changes in the forseeable future would be (approximately) January.
Sorry about the bad news, -Katie
On Tue, Nov 13, 2012 at 12:10 PM, Andrew Otto otto@wikimedia.org wrote:
Hi guys!
So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx.
- Append Accept-Language and X-Carrier headers.
This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/
- Change field separator from space to tab.
User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues.
We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon.
Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through?
Thanks! -Andrew Otto
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We have a number of things that process the UDP logs for analytics on banners and landing pages. I don't think we are talking about weeks of effort to fix them, but we don't even have days at this point. Right now, everything is working to the best extent that we can make it. I believe the "havoc" that Katie is speaking about is changing things around 13 days before the full on launch of the fundraiser, especially when next week is Thanksgiving. With the possibility of little bugs in half a dozen scripts, that could be a massive headache.
On Tue, Nov 13, 2012 at 1:35 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
Hi Katie, Could you please give a bit more details regarding "significant amount of havoc with various crucial systems we have in place". Thanks! Diederik
On Tue, Nov 13, 2012 at 3:58 PM, Katie Horn khorn@wikimedia.org wrote:
Hi Andrew,
We are almost completely sure that this set of changes would cause a significant amount of havoc with various crucial systems we have in place, and we definitely don't have time to shake bugs out of those systems at this point, as they are all already in heavy use. The only good time to deploy those changes in the forseeable future would be (approximately) January.
Sorry about the bad news, -Katie
On Tue, Nov 13, 2012 at 12:10 PM, Andrew Otto otto@wikimedia.org wrote:
Hi guys!
So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx.
- Append Accept-Language and X-Carrier headers.
This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/
- Change field separator from space to tab.
User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues.
We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon.
Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through?
Thanks! -Andrew Otto
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Right, let's schedule this for early January, once the fundraiser is finished. D
On Tue, Nov 13, 2012 at 4:40 PM, Peter Gehres pgehres@wikimedia.org wrote:
We have a number of things that process the UDP logs for analytics on banners and landing pages. I don't think we are talking about weeks of effort to fix them, but we don't even have days at this point. Right now, everything is working to the best extent that we can make it. I believe the "havoc" that Katie is speaking about is changing things around 13 days before the full on launch of the fundraiser, especially when next week is Thanksgiving. With the possibility of little bugs in half a dozen scripts, that could be a massive headache.
On Tue, Nov 13, 2012 at 1:35 PM, Diederik van Liere < dvanliere@wikimedia.org> wrote:
Hi Katie, Could you please give a bit more details regarding "significant amount of havoc with various crucial systems we have in place". Thanks! Diederik
On Tue, Nov 13, 2012 at 3:58 PM, Katie Horn khorn@wikimedia.org wrote:
Hi Andrew,
We are almost completely sure that this set of changes would cause a significant amount of havoc with various crucial systems we have in place, and we definitely don't have time to shake bugs out of those systems at this point, as they are all already in heavy use. The only good time to deploy those changes in the forseeable future would be (approximately) January.
Sorry about the bad news, -Katie
On Tue, Nov 13, 2012 at 12:10 PM, Andrew Otto otto@wikimedia.orgwrote:
Hi guys!
So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx.
- Append Accept-Language and X-Carrier headers.
This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/
- Change field separator from space to tab.
User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues.
We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon.
Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through?
Thanks! -Andrew Otto
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Peter Gehres
Fundraiser Production Manager Wikimedia Foundation
Thanks Diederik. Sorry if this puts a wrench in some of your plans. As much as I like the idea of switching to \t and adding accept-language, we just don't have the capacity at the moment to update and then fully re-test everything we have.
Hopefully the fundraiser goes weel enough for you guys to get that extra hardware you guys want :-)
On Tue, Nov 13, 2012 at 1:44 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
Right, let's schedule this for early January, once the fundraiser is finished. D
On Tue, Nov 13, 2012 at 4:40 PM, Peter Gehres pgehres@wikimedia.orgwrote:
We have a number of things that process the UDP logs for analytics on banners and landing pages. I don't think we are talking about weeks of effort to fix them, but we don't even have days at this point. Right now, everything is working to the best extent that we can make it. I believe the "havoc" that Katie is speaking about is changing things around 13 days before the full on launch of the fundraiser, especially when next week is Thanksgiving. With the possibility of little bugs in half a dozen scripts, that could be a massive headache.
On Tue, Nov 13, 2012 at 1:35 PM, Diederik van Liere < dvanliere@wikimedia.org> wrote:
Hi Katie, Could you please give a bit more details regarding "significant amount of havoc with various crucial systems we have in place". Thanks! Diederik
On Tue, Nov 13, 2012 at 3:58 PM, Katie Horn khorn@wikimedia.org wrote:
Hi Andrew,
We are almost completely sure that this set of changes would cause a significant amount of havoc with various crucial systems we have in place, and we definitely don't have time to shake bugs out of those systems at this point, as they are all already in heavy use. The only good time to deploy those changes in the forseeable future would be (approximately) January.
Sorry about the bad news, -Katie
On Tue, Nov 13, 2012 at 12:10 PM, Andrew Otto otto@wikimedia.orgwrote:
Hi guys!
So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx.
- Append Accept-Language and X-Carrier headers.
This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/
- Change field separator from space to tab.
User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues.
We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon.
Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through?
Thanks! -Andrew Otto
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Peter Gehres
Fundraiser Production Manager Wikimedia Foundation
Sorry for the dumb question, but I've followed this discussion for a while and I didn't manage to understand yet: this change won't affect the pageview stats format in any way, will it? (I assume not, but better safe than sorry.) http://dumps.wikimedia.org/other/pagecounts-raw/
Nemo
Hi Nemo,
No, this change will not impact the data on dumps.wikimedia.org. However, we do have some pending changes to dumps.wikimedia.org: * adding new domain names (like blog, wikimediationfoundation, planet, wikidata, wikivoyage)
Do you have any specific requests regarding dumps.wikimedia.org? best, Diederik
On Tue, Nov 13, 2012 at 6:42 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Sorry for the dumb question, but I've followed this discussion for a while and I didn't manage to understand yet: this change won't affect the pageview stats format in any way, will it? (I assume not, but better safe than sorry.) http://dumps.wikimedia.org/**other/pagecounts-raw/http://dumps.wikimedia.org/other/pagecounts-raw/
Nemo
______________________________**_________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics
Diederik van Liere, 14/11/2012 00:51:
Do you have any specific requests regarding dumps.wikimedia.org http://dumps.wikimedia.org?
Only that Domas' pageview logs are kept functioning and without any format change. That's by far the most used stats tool in Wikimedia land. Of course more domains are also appreciated! But that's another story.
Nemo
The idea is to add extra metrics to the dataset in a way that doesn't break what is there.
Right now all html requests are counted equally. It would be very useful to have a more sanitized count that come close to human page views. Meaning al bot requests are excluded (user agent contains bot/spider/crawler/http). also 404's might be counted separately. But this would be extra data lines in the file with different codes. Just like mobile metrics were added two years ago. To be vetted before implementation :-)
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Federico Leva (Nemo) Sent: Wednesday, November 14, 2012 1:18 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Cc: fr-tech@wikimedia.org Subject: Re: [Analytics] Web Access Log Format Changes
Diederik van Liere, 14/11/2012 00:51:
Do you have any specific requests regarding dumps.wikimedia.org http://dumps.wikimedia.org?
Only that Domas' pageview logs are kept functioning and without any format change. That's by far the most used stats tool in Wikimedia land. Of course more domains are also appreciated! But that's another story.
Nemo
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics