So far nobody has responded to my inquiry on whether they would be affected
by this chance. So please let us know if you are consuming a server log and
you are expecting spaces as delimiters. We want to make sure that we are
aware of all the people that will be affected by this.
Best,
Diederik
On Thu, May 10, 2012 at 9:51 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
Do we really
need to write code? Don't all of these servers have a
directive for setting a
custom log format? (nginx:
http://wiki.nginx.org/HttpLogModule#log_format)
Yeah, that's what we want to do. Robla is worried that changing the log
format using tabs will break things for too many people. Gotta find out
how many and who.
On May 9, 2012, at 6:23 PM, David Schoonover wrote:
Do we really need to write code? Don't all of
these servers have a
directive for setting a custom log format? (nginx:
http://wiki.nginx.org/HttpLogModule#log_format)
Relatedly, did you happen to try testing other whitespace characters? I
suspect
some subset of tabs \t, vertical tabs \v, and/or the line break
family \r\n\f get escaped consistently. (The stream modifiers \b\c\h\w are
illegal.) If we were to figure that out, we could just switch to that
character as the field delimiter and be done with it.
ps. It turns out it's legal to split header-values across newlines so
long as
the first character following the CRLF is a space or tab. So we
might want to deal with that. Just sayin'. (It's called Linear White Space
-- LWS.)
Some potentially useful HTTP RFC links:
- LWS:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
- Modified BNF definition:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
- The Content-Type definition, which talks about
"extended" attributes
in values, separated by semi-colons:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7
- Header definitions:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14
pps. For what it's worth, the C source to nginx's HTTP header parser is
available:
- ngx_http_parse_header_line:
http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_ht…
- ngx_http_parse_multi_header_lines:
http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_ht…
--
David Schoonover
dsc(a)wikimedia.org
On May 9, 2012, at 1:40 PM, Andrew Otto wrote:
> (Moving this to analytics list, cause uhhh, why not?)
>
> Ok some more info!
>
> I was testing some of my changes to udp-filter on my test labs
instance, and
hey! whadyaknow! The test failed. One of the failures was
due to the content type field not matching correctly. It looked like this:
> text/html; charset=UTF-8
> And actually, my test didn't catch the charset portion of this, my
regexp
is matching via whitespace. It was the semi-colon in the text/html
string that caused the test to fail.
>
> I started sleuthing, and determined that this only shows up in my logs
if I am
hitting mediawiki pages. I grepped mediawiki core, and found that
there are tons of places that are manually setting the Content-Type header
with the charset.
>
> So I'm pretty sure the setting of the value of this header is coming
directly from Mediawiki. But, that isn't really our issue. Headers are
allowed to have spaces in the values.
>
>
> Here's some example log output with spaces in headers.
>
https://gist.github.com/2648312
>
>
> Summary:
> Nginx:
> - Does not escape spaces in Content-Type
> - Does not escape spaces in Accept-Language
> - Does not escape spaces in any header (afaict*).
>
> Varnish:
> - Does not escape space in Content-Type
> + Escapes spaces in Accept-Langage
> + Escapes spaces in any header (afaict).
>
> Squid:
> ~ Removes charset from Content-Type header (?)
> + Escapes spaces in Accept-Language.
> + Escapes spaces in any header (afaict).
>
> The remaining question we need to answer: Do we want to patch source
code for
these guys in order to fix this problem? I'd rather not if we
don't have to. Do we have to? Uhhh, dunno!
>
>
> -Ao
>
>
> *As Far As I Can Tell
>
>
> On May 9, 2012, at 11:42 AM, Erik Zachte wrote:
>
>> For scripts it makes no difference: split("\t") is fine
>>
>> For browsing lines in an editor: I can disable tab expansion (or pipe
to
sed first)
>>
>> For human readability of pasted log lines in e.g. mail, it's no big
deal (just slightly more ambiguous than a printable char)
>>
>> So, I'm fine which whatever works best, inter-process wise.
>>
>> Downward compatibility is another issue, but subscribers to our logs
could
get tabs replaced by blanks before data are forwarded.
>> Anyway please don't change this
overnight.
>>
>> --
>>
>> *KISS*
>>
>> Then again. Diederiks filter could easily and cheaply change all
spaces to
underscores, then all tabs to spaces, and no-one would even have
to think twice, or make any changes.
>>
>> Erik
>>
>>
>>
>> From: Diederik van Liere [mailto:dvanliere@wikimedia.org]
>> Sent: Wednesday, May 09, 2012 3:59 PM
>> To: Andrew Otto
>> Cc: Erik Zachte; Rob Lanphier
>> Subject: Re: Using tab as delimiter instead of space in the log files
>>
>> Even I thought 12 months ago, "how hard can it be to use spaces only
as delimiters" in a server log line. And 12 months later I now *know* it is
extremely hard and those spaces will keep coming back, and we will keep
discarding those log lines because they do not match the expected number of
fields. Right now, these space issues are giving me headaches to come up
with reasonable WikipediaMobile app usage for the Mobile team.
>>
>> But think a couple of months ahead when we want to stream our data
into
the cluster, do we really want a gazillion if else if statements to
make sure that the data is in the right format before we store in the
cluster? This will create a huge performance problem.
>>
>> By keep fixing spaces we will always behind the facts, because it
takes a
while before we discover the issue, and then we need to make
changes to the config and get it approved or we fix it for once and for all.
>>
>> Because the user agent info is freeform I think it's safest to go with
a tab control character but I am open for other suggestions.
>>
>> Best,
>> D
>>
>> On Wed, May 9, 2012 at 9:40 AM, Andrew Otto <otto(a)wikimedia.org>
wrote:
>>> we have a lot of software scattered
all over the place processing
>>> this stuff (much of which is outside the purview of this group), and
>>> we're probably going to break at least some of it with this change.
>> If so ja, could be a problem. How widespread is this? We'd like to
keep track of these uses going forward.
>>
>>>>> Spaces, tabs and any
>>>>> non-URL character in a URL needs to be escaped by
>>>>> Varnish/nginx/Squid/whatever
>>>> the source can be located within hours and
>>>> patched.
>>> I really don't think we should start patching server software (like
we did with nginx user agent string) to replace spaces.
>> Diederik, I agree. If we can figure out
how to always escape things
properly in the log format config directives for each
log source
(nginx,squid,varnishncsa), then no problem. But patching and recompiling
every time we find a new edge case like this would be really annoying and
kinda dangerous. The more changes we make to upstream sources the more we
have to remember to maintain and the more bug prone we make crucial
software.
>>
>> Hmm, if we can figure out how to patch each of theses sources only
once to
make all whitespace characters in log fields escaped, then that
might be ok. Mayyyybe, I dunno. What if we add a new source? Then we'd
have to figure out how to patch it too.
>>
>>>> is there any printable
>>>> char that could be used?
>> Technically I'm sure we could find one, but I think we'd have more
complaints about readability in emails and editors if we don't use
whitespace than if we just used a tab.
>>
>> What if we used multiple spaces (2? 4? 5?) as a delimiter? I'm not
sure all the log format config directives would allow us to, but this might
be less prone to editor weirdness, and also might cause problems for fewer
users. Then they can split(' '). Although, hm, I guess that isn't much
better than split("\t"), eh?
>>
>> -A
>>
>>
>>
>> On May 9, 2012, at 9:21 AM, Diederik van Liere wrote:
>>
>>>
>>> The tab would be inserted by varnish/squid/nginx when writing the
log,
we don't need to worry about URL transport or anything like that.
>>> We keep running into new cases where
spaces show up where we don't
expect them, and I really don't think we
should start patching server
software (like we did with nginx user agent string) to replace spaces.
>>> That locks us into specific versions
of the software and the
alternative is easy and robust.
>>>
>>> Erik: I don't think that editor settings are a convincing argument
against the use of tabs.
>>>
>>> I am going to send this proposal to Wikitech mailing list.
>>>
>>>
>>> D
>>> On 2012-05-09, at 7:29 AM, Erik Zachte wrote:
>>>
>>>> I'm not strongly pro or con but here are some thoughts:
>>>>
>>>> I always get confused by tabs as they will be presented differently
>>>> depending on editor settings (and different editors have different
>>>> defaults).
>>>> If someone pastes some log lines in an email, tabs might already
have been
>>>> expanded.
>>>>
>>>> Also Diederik's new filter will validate the line, and count fields.
>>>> If it signals invalid lines, the source can be located within hours
and
>>>> patched.
>>>> How many sources could there be? We have 3 streams?
>>>> In theory I am in favor of validating at the earliest point.
>>>>
>>>> --
>>>>
>>>> I know this is tricky, but if we decide to change, is there any
printable
>>>> char that could be used?
>>>>
>>>> There are some characters that are out-of-bounds for urls, but some
of those
>>>> are unsafe for other reasons
>>>>
http://www.ietf.org/rfc/rfc1738.txt 2.2 section par. 'unsafe:'
>>>>
>>>> <quote>
>>>> Other characters are unsafe because gateways and other transport
agents are
>>>> known to sometimes modify such
characters.
>>>> These characters are "{", "}", "|",
"\", "^", "~", "[", "]", and
"`".
>>>> <unquote>
>>>>
>>>> One that comes to mind, but may have other quirks is 127/7F/
>>>>
>>>> Another 2 cents.
>>>> Erik
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Diederik van Liere [mailto:dvanliere@wikimedia.org]
>>>> Sent: Wednesday, May 09, 2012 3:33 AM
>>>> To: Rob Lanphier
>>>> Cc: Andrew Otto; Erik Zachte
>>>> Subject: Re: Using tab as delimiter instead of space in the log files
>>>>
>>>> a) Not sure why this wouldn't solve the problem, the problem is that
we
>>>> consistently are encountering
spaces in fields while we are also
using space
>>>> as delimiter. By changing the
delimiter we are no longer affected by
spaces
>>>> in the fields. Varnish does not
need to escape tabs, those tabs are
>>>> configured in the varnish/squid/nginx config files. Andrew, can you
confirm
>>>> this?
>>>>
>>>> b) Who else is consuming the log files besides the analytics team
and Erik
>>>> Zachte? And changing to a tab as
delimiter is going to be a small
change, so
>>>> yes it needs to be communicated
to the people who consume these log
files
>>>> but not fixing it or alternative
writing all kinds of patches seems
a worse
>>>> solution.
>>>>
>>>>
>>>> My 2 cents.
>>>> D
>>>>
>>>> On 2012-05-08, at 9:25 PM, Rob Lanphier wrote:
>>>>
>>>>> On Tue, May 8, 2012 at 6:06 PM, Diederik van Liere
>>>>> <dvanliere(a)wikimedia.org> wrote:
>>>>>> Instead of continuing patching our servers to fix these spaces
issues I
>>>> strongly suggest that we move
away from the space as delimiter and
start
>>>> using the tab (\t) character.
>>>>>
>>>>> Eeek! I'm pretty sure that this will:
>>>>> a) not really solve the problem. If the Varnish isn't escaping
>>>>> spaces, what makes us think it's escaping tabs? Spaces, tabs and
any
>>>>> non-URL character in a URL
needs to be escaped by
>>>>> Varnish/nginx/Squid/whatever
>>>>> b) we have a lot of software scattered all over the place
processing
>>>>> this stuff (much of which is
outside the purview of this group), and
>>>>> we're probably going to break at least some of it with this
change.
>>>>>
>>>>> Regardless, we'll need to have a much wider conversation about
such
a
>>> change than this group.
>>>
>>> Rob
>>
>>
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics