Apologies for crossposting
Heya,
The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers. We would like to do this on February 1st. The reason for this change is that we need to have a consistent number of fields in each webrequest log line. Right now, some fields contain spaces and that require a lot of post-processing cleanup and slows down the generation of reports.
What is affected and maintained by Analytics
* udp-filter already has support for the tab character * webstatscollector: we compiled a new version of filter to add support for the tab character * wikistats: we will fix the scripts on an ongoing basis. * udp2log: we have a patch ready for inserting sequence numbers separated by tab.
In particular, I would like to have feedback to three questions:
1) Are there important reasons not to use tab as field delimiter?
2) Are there important pieces of logging that expect a space instead of a tab and that need to be fixed and that I did not mention in this email?
3) Is February 1st a good date to deploy this change? (Assuming that all preps are finished)
Best,
Diederik
[dropping fr-2012 in bcc in exchange for fr-tech]
Fundraising should be able to make the switch fairly quickly. It should also eliminate of the the annoying little bugs as well. In actuality, our code shouldn't notice as a tab is whitespace. :-)
Thanks Diederik.
On Fri, Jan 25, 2013 at 9:41 AM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
Apologies for crossposting
Heya,
The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers. We would like to do this on February 1st. The reason for this change is that we need to have a consistent number of fields in each webrequest log line. Right now, some fields contain spaces and that require a lot of post-processing cleanup and slows down the generation of reports.
What is affected and maintained by Analytics
- udp-filter already has support for the tab character
- webstatscollector: we compiled a new version of filter to add support
for the tab character
- wikistats: we will fix the scripts on an ongoing basis.
- udp2log: we have a patch ready for inserting sequence numbers separated
by tab.
In particular, I would like to have feedback to three questions:
Are there important reasons not to use tab as field delimiter?
Are there important pieces of logging that expect a space instead of a
tab and that need to be fixed and that I did not mention in this email?
- Is February 1st a good date to deploy this change? (Assuming that all
preps are finished)
Best,
Diederik
On 01/25/2013 12:41 PM, Diederik van Liere wrote:
Apologies for crossposting
Heya,
The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers.
I don't know either way if any other code is affected. But I'm wondering, what will happen if the input (e.g. a header value) contains a tab? Is it sanitized by Varnish/Squid/nginx?
Matt Flaschen
Sent from my iPhone
On 2013-01-26, at 2:22, Matthew Flaschen mflaschen@wikimedia.org wrote:
On 01/25/2013 12:41 PM, Diederik van Liere wrote:
Apologies for crossposting
Heya,
The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers.
I don't know either way if any other code is affected. But I'm wondering, what will happen if the input (e.g. a header value) contains a tab? Is it sanitized by Varnish/Squid/nginx?
Hey Matt This is exactly the reason why we want to move from space to tab as field delimiter. Spaces occur frequently in a header value and it requires us to writes patches for squid/varnish/Nginx to escape the spaces. Writing these patches happened very infrequently and is cumbersome because you need to do it three times.
Now there could be a tab in a header value as well but I have never seen it in our logfiles and i also grepped for it on a couple of random files and found no such occurrences. So we are not going to escape tab characters in fields unless new information changes our mind.
Best D
Matt Flaschen
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Saturday, January 26, 2013 at 12:57 PM, Diederik van Liere wrote:
Now there could be a tab in a header value as well but I have never seen it in our logfiles and i also grepped for it on a couple of random files and found no such occurrences. So we are not going to escape tab characters in fields unless new information changes our mind.
It would be nice the eliminate this worry categorically. I checked, and it appears that varnishncsa and varnishlog do not escape tabs.
How I tested:
varnishd -a :10200 -b 173.194.79.104 -F
This will start a varnish instance on port 10200 that uses Google as a back-end. Then:
varnishncsa
And in another shell:
curl -I --user-agent QQQQ$'\t'ZZZZ http://127.0.0.1:10200
(You can also add a tab to the command line by typing Ctrl-V + TAB.)
The output of varnishncsa is:
127.0.0.1 - - [26/Jan/2013:18:12:28 -0800] "HEAD http://127.0.0.1:10200/ HTTP/1.1" 200 0 "-" "QQQQ ZZZZ"
So the tab is not escaped.
According to RFC 2616 (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2), tabs are permitted in HTTP headers. In particular, a leading tab can be used to construct multi-line header strings. I have no idea how common this is (I suspect it's pretty rare), but who knows.
I don't think means you have to ditch tabs -- I doubt there's a problem-free delimiter. But you should write code and configure software with the expectation that literal tabs will be encountered so that you can deal with it gracefully.
-- Ori Livneh
We had major data loss on two occasions because of misaligned fields. Broken records just fell through and went unnoticed, causing up to some 20% data loss for months. There was embarrassment all over. It took many days to repair the data as much as possible. So if we fix this, let us make it really robust, without causing too much overhead.
If we bet tabs will never be introduced in the input stream, even where they are legal, and we take that approach on several issues, one of those unlikely events will surely happen sooner than you think (these are not random mutations, somewhere someone follows the same logic of making another layer more robust, call it evolutionary convergence).
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
Scanning an input record for tabs and raising a counter is also very efficient. Sending one alert hourly based on this counter should make us aware soon enough when this issue needs follow-up, yet without causing bottle necks.
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Ori Livneh Sent: Sunday, January 27, 2013 8:14 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] RFC: Tab as field delimiter in logging format of cache servers
On Saturday, January 26, 2013 at 12:57 PM, Diederik van Liere wrote:
Now there could be a tab in a header value as well but I have never seen
it in our logfiles and i also grepped for it on a couple of random files and found no such occurrences. So we are not going to escape tab characters in fields unless new information changes our mind.
It would be nice the eliminate this worry categorically. I checked, and it appears that varnishncsa and varnishlog do not escape tabs.
How I tested:
varnishd -a :10200 -b 173.194.79.104 -F
This will start a varnish instance on port 10200 that uses Google as a back-end. Then:
varnishncsa
And in another shell:
curl -I --user-agent QQQQ$'\t'ZZZZ http://127.0.0.1:10200
(You can also add a tab to the command line by typing Ctrl-V + TAB.)
The output of varnishncsa is:
127.0.0.1 - - [26/Jan/2013:18:12:28 -0800] "HEAD http://127.0.0.1:10200/ HTTP/1.1" 200 0 "-" "QQQQ ZZZZ"
So the tab is not escaped.
According to RFC 2616 (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2), tabs are permitted in HTTP headers. In particular, a leading tab can be used to construct multi-line header strings. I have no idea how common this is (I suspect it's pretty rare), but who knows.
I don't think means you have to ditch tabs -- I doubt there's a problem-free delimiter. But you should write code and configure software with the expectation that literal tabs will be encountered so that you can deal with it gracefully.
-- Ori Livneh
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Sunday, January 27, 2013 at 5:07 AM, Erik Zachte wrote:
<snipped>
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
FWIW, this seems like the best idea to me.
-- Ori Livneh
On 01/27/2013 08:07 AM, Erik Zachte wrote:
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
Scanning an input record for tabs and raising a counter is also very efficient. Sending one alert hourly based on this counter should make us aware soon enough when this issue needs follow-up, yet without causing bottle necks.
Doing both of those would be pretty robust. However, if that isn't workable, a simple option is just to strip tab characters before Varnish/Squid/etc. writes the line.
That means downstream code doesn't have to do anything special, and it shouldn't affect many actual requests.
Matt Flaschen
Hi all!
Just an FYI here that this has been done, yay! Varnish, Nginx, and Squid frontends are now all logging with tab as the field delimiter.
For those who would notice, for the time being, we have started outputting logs to new filenames with .tab. in the name, so as to differentiate the format. We will most likely change the file names back to their original names in a month or so.
Thanks all! -Andrew Otto
On Jan 28, 2013, at 11:33 AM, Matthew Flaschen mflaschen@wikimedia.org wrote:
On 01/27/2013 08:07 AM, Erik Zachte wrote:
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
Scanning an input record for tabs and raising a counter is also very efficient. Sending one alert hourly based on this counter should make us aware soon enough when this issue needs follow-up, yet without causing bottle necks.
Doing both of those would be pretty robust. However, if that isn't workable, a simple option is just to strip tab characters before Varnish/Squid/etc. writes the line.
That means downstream code doesn't have to do anything special, and it shouldn't affect many actual requests.
Matt Flaschen
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We will most likely change the file names back to their original names in a month or so
Please don't. It'll serve as a visible marker for the future for when we go back and look at the files and do a WTF.
~Matt Walker
Yes let's not change the filenames D
Sent from my iPhone
On 2013-01-31, at 18:45, Matthew Walker mwalker@wikimedia.org wrote:
We will most likely change the file names back to their original names in a month or so
Please don't. It'll serve as a visible marker for the future for when we go back and look at the files and do a WTF.
~Matt Walker _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ah, no I mean change the future ones back to their original names. We'd leave the ones that are being generated as '.tab.' now as they are. We could see these filenames in the archives.
But! If everybody loves '.tab.', forever, that's fine with me too!
On Jan 31, 2013, at 6:51 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Yes let's not change the filenames D
Sent from my iPhone
On 2013-01-31, at 18:45, Matthew Walker mwalker@wikimedia.org wrote:
We will most likely change the file names back to their original names in a month or so
Please don't. It'll serve as a visible marker for the future for when we go back and look at the files and do a WTF.
~Matt Walker _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 31, 2013 at 7:12 PM, Andrew Otto otto@wikimedia.org wrote:
Ah, no I mean change the future ones back to their original names. We'd leave the ones that are being generated as '.tab.' now as they are. We could see these filenames in the archives.
But! If everybody loves '.tab.', forever, that's fine with me too!
I really don't like .tab.
-Chad
.tsv - tab separated values?
On Thu, Jan 31, 2013 at 7:23 PM, Chad innocentkiller@gmail.com wrote:
On Thu, Jan 31, 2013 at 7:12 PM, Andrew Otto otto@wikimedia.org wrote:
Ah, no I mean change the future ones back to their original names. We'd
leave the ones that are being generated as '.tab.' now as they are. We could see these filenames in the archives.
But! If everybody loves '.tab.', forever, that's fine with me too!
I really don't like .tab.
-Chad
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
.tsv doh!
That I like! That woulda been so much cleaner and obvious and I wouldn't mind leaving it forever.
Hokay!
On Jan 31, 2013, at 7:25 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
.tsv - tab separated values?
On Thu, Jan 31, 2013 at 7:23 PM, Chad innocentkiller@gmail.com wrote:
On Thu, Jan 31, 2013 at 7:12 PM, Andrew Otto otto@wikimedia.org wrote:
Ah, no I mean change the future ones back to their original names. We'd
leave the ones that are being generated as '.tab.' now as they are. We could see these filenames in the archives.
But! If everybody loves '.tab.', forever, that's fine with me too!
I really don't like .tab.
-Chad
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hey Mark,
See my e-mail below and Erik Zachte's reply. I don't mind switching to tabs (I'm just making the change to the EventLogging config right now..) but I think we still need to escape whatever character we choose as delimiter in HTTP headers
-- Ori Livneh
Forwarded message:
From: Erik Zachte ezachte@wikimedia.org Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. analytics@lists.wikimedia.org To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. analytics@lists.wikimedia.org Date: Sunday, January 27, 2013 5:07:50 AM Subject: Re: [Analytics] RFC: Tab as field delimiter in logging format of cache servers
We had major data loss on two occasions because of misaligned fields. Broken records just fell through and went unnoticed, causing up to some 20% data loss for months. There was embarrassment all over. It took many days to repair the data as much as possible. So if we fix this, let us make it really robust, without causing too much overhead.
If we bet tabs will never be introduced in the input stream, even where they are legal, and we take that approach on several issues, one of those unlikely events will surely happen sooner than you think (these are not random mutations, somewhere someone follows the same logic of making another layer more robust, call it evolutionary convergence).
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
Scanning an input record for tabs and raising a counter is also very efficient. Sending one alert hourly based on this counter should make us aware soon enough when this issue needs follow-up, yet without causing bottle necks.
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org (mailto:analytics-bounces@lists.wikimedia.org) [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Ori Livneh Sent: Sunday, January 27, 2013 8:14 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] RFC: Tab as field delimiter in logging format of cache servers
On Saturday, January 26, 2013 at 12:57 PM, Diederik van Liere wrote:
Now there could be a tab in a header value as well but I have never seen
it in our logfiles and i also grepped for it on a couple of random files and found no such occurrences. So we are not going to escape tab characters in fields unless new information changes our mind.
It would be nice the eliminate this worry categorically. I checked, and it appears that varnishncsa and varnishlog do not escape tabs.
How I tested:
varnishd -a :10200 -b 173.194.79.104 -F
This will start a varnish instance on port 10200 that uses Google as a back-end. Then:
varnishncsa
And in another shell:
curl -I --user-agent QQQQ$'\t'ZZZZ http://127.0.0.1:10200
(You can also add a tab to the command line by typing Ctrl-V + TAB.)
The output of varnishncsa is:
127.0.0.1 - - [26/Jan/2013:18:12:28 -0800] "HEAD http://127.0.0.1:10200/ HTTP/1.1" 200 0 "-" "QQQQ ZZZZ"
So the tab is not escaped.
According to RFC 2616 (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2), tabs are permitted in HTTP headers. In particular, a leading tab can be used to construct multi-line header strings. I have no idea how common this is (I suspect it's pretty rare), but who knows.
I don't think means you have to ditch tabs -- I doubt there's a problem-free delimiter. But you should write code and configure software with the expectation that literal tabs will be encountered so that you can deal with it gracefully.
-- Ori Livneh
Analytics mailing list Analytics@lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/analytics
On Mar 20, 2013, at 7:01 PM, Ori Livneh ori@wikimedia.org wrote:
Hey Mark,
See my e-mail below and Erik Zachte's reply. I don't mind switching to tabs (I'm just making the change to the EventLogging config right now..) but I think we still need to escape whatever character we choose as delimiter in HTTP headers
Thanks.
As for escaping field separators... The logging has been changed to use tabs for some time already, and we've never been escaping tabs, just spaces. So I'm not changing /that/ part right now, just disabling the escaping of spaces as well, as there doesn't seem to be any use anymore.
Forwarded message:
From: Erik Zachte ezachte@wikimedia.org
We had major data loss on two occasions because of misaligned fields. Broken records just fell through and went unnoticed, causing up to some 20% data loss for months. There was embarrassment all over. It took many days to repair the data as much as possible. So if we fix this, let us make it really robust, without causing too much overhead.
If we bet tabs will never be introduced in the input stream, even where they are legal, and we take that approach on several issues, one of those unlikely events will surely happen sooner than you think (these are not random mutations, somewhere someone follows the same logic of making another layer more robust, call it evolutionary convergence).
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
Scanning an input record for tabs and raising a counter is also very efficient. Sending one alert hourly based on this counter should make us aware soon enough when this issue needs follow-up, yet without causing bottle necks.
Replacing tabs by a single other character is a lot better efficiency wise than escaping indeed. Because it doesn't change the length of the string, thus doesn't need a new memory allocation and string copy. Still, scanning a string is not completely trivial overhead (Varnish itself for example tries to avoid it at all cost), and I would prefer to avoid it if we can. Varnishncsa uses a significant amount of CPU right now, and since we're reducing the number of Varnish servers and increasing the amount of requests per box, I'd like to make it as efficient as we can.
I like the idea of monitoring the situation, and acting on it when needed. Since a log record should always contain a fixed number of tabs (field separators), anything wrong should be very easy to detect on the log processing side I think.
Thanks for thinking along,
I like the idea of monitoring the situation, and acting on it when needed. Since a log record should always contain a fixed number of tabs (field separators), anything wrong should be very easy to detect on the log processing side I think.
I agree, we could just spit out everything that doesn't have the expected number of tabs to a separate file.
On Thursday, March 21, 2013 at 4:12 AM, Mark Bergsma wrote:
I like the idea of monitoring the situation, and acting on it when needed. Since a log record should always contain a fixed number of tabs (field separators), anything wrong should be very easy to detect on the log processing side I think.
Yup, so we can be prepared for things like this: http://lists.wikimedia.org/pipermail/wikitech-l/2010-February/046788.html
:D
-- Ori Livneh
On 03/21/2013 04:33 PM, Ori Livneh wrote:
On Thursday, March 21, 2013 at 4:12 AM, Mark Bergsma wrote:
I like the idea of monitoring the situation, and acting on it when needed. Since a log record should always contain a fixed number of tabs (field separators), anything wrong should be very easy to detect on the log processing side I think.
Yup, so we can be prepared for things like this: http://lists.wikimedia.org/pipermail/wikitech-l/2010-February/046788.html
That's awesome. I am tempted to make a version with half spaces half tabs.
Matt Flaschen