On May 10, 2012, at 2:54 PM, Diederik van Liere wrote:

Guys, this is turning into a complete bike-shed discussion.

I suggest the following:
1) We move to the tab character as delimiter, this not 100% accurate but will cause way way way fewer issues than space

2) We will extensively test this in the Labs environment where we have nginx/varnish/squid running
3) We will notify all log consumers before hand, about 2 weeks notice.
4) We will give Erik Zachte ample time to adjust and we supply him test data. The two weeks notice starts as soon as Erik has given thumbs up.

How does that sound?

Best,
Diederik

On Thu, May 10, 2012 at 2:50 PM, David Schoonover <dsc@wikimedia.org> wrote:

I don't think charsub is all that complex or scary. However, I think substituting in _ is a really bad idea.

** Once you do this, you cannot undo it, because _ is a valid character in all fields[1]. **

And for some data, there's a huge difference. It may be obvious that "text/html;_charset_=_utf8" is "text/html; charset = utf8", but in the case of a client sending a URL that isn't properly URL-encoded, the meaning of the request is totally changed if you convert "http://wikimedia.org/ " (which should have been encoded by the sending client to "http://wikimedia.org/%20") to "http://wikimedia.org/_". But there's no inversion function: you don't know to if "http://wikimedia.org/_" is "http://wikimedia.org/ " or really, actually "http://wikimedia.org/_".

So the obvious next question is: why not escape them ourselves? Because now you need a string copy, as escaping isn't 1:1 in characters (" " becomes "\ " (or whatever), which is more than one character). This comes back to what I was asking before: is there any whitespace character that is escaped by all our log sources? (My suspicion is that either \r, \n, or \v is escaped by everyone.)

If we can't find a whitespace character, using a non-semantic control character (0x0-0x31) should work, but it's riskier as some downstream consumers might choke on non-printable characters. Still: the best option here, hands down, is Bell (\a 0x07). Most unix programs understand it but don't do anything scary with it, and it doesn't change the meaning of the string. (We might make random machines beep. I am okay with this.) Additionally, it doesn't match '\s' in PCRE, which some people might be using to split the output.

If for some reason Bell isn't acceptable, Form Feed (\f 0x0C) is probably our next-best option. Unfortunately, it matches '\s', and (heh) prints as six newlines. (But if you're printing our logs, god help you.) Using any of the rest is sketchy, though with some testing, the Device Control characters (0x11-0x14) might be okay.

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 -- "Many HTTP/1.1 header field values consist of words separated by LWS or special characters."

CHAR = <any US-ASCII character (octets 0-127)>
CTL = <any US-ASCII control character (octets 0-31 + 127)>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT

See also: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14

--
David Schoonover
dsc@wikimedia.org

On May 10, 2012, at 11:11 AM, Andrew Otto wrote:

> Well, he's still suggesting we switch to tab as delimiter in sources. Same solution, but with the extra bonus of allowing udp-filter to give our downstream consumers what they currently expect.
>
>
> On May 10, 2012, at 2:08 PM, Diederik van Liere wrote:
>
>> Hi Erik,
>>
>> Yes it is downwards compatible but does not outweigh the drawbacks. It's not simple, as it creates a disconnect between the configuration of the server log and the actual output. In addition, it is not a future proof solution because we also want to stream the server log data to the analytics cluster and then we will be still stuck with the same problem (as streaming the data into the analytics cluster will not depend on the udp-filter software). We should apply a real solution not a monkey patch.
>>
>> D
>>
>> On Thu, May 10, 2012 at 2:03 PM, Erik Zachte <ezachte@wikimedia.org> wrote:
>> There are more suggestions hanging in the air waiting to be shot down.
>>
>>
>>
>> Character replacement in c is very cheap.
>>
>> So why not feed Diederik's filter with tab delimited data, and export space delimited data?
>>
>>
>>
>> The filter first replaces all (non delimiting) spaces by underscores, then replaces all (delimiting) tabs by spaces.
>>
>>
>>
>> Simple, and downwards compatible.
>>
>>
>>
>> Erik
>>
>>
>>
>>
>> From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere
>> Sent: Thursday, May 10, 2012 3:57 PM
>> To: analytics@lists.wikimedia.org
>> Subject: Re: [Analytics] Using tab as delimiter instead of space in the log files
>>
>>
>>
>> So far nobody has responded to my inquiry on whether they would be affected by this chance. So please let us know if you are consuming a server log and you are expecting spaces as delimiters. We want to make sure that we are aware of all the people that will be affected by this.
>>
>>
>>
>> Best,
>>
>> Diederik
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics