Hey Mark,
See my e-mail below and Erik Zachte's reply. I don't mind switching to tabs
(I'm just making the change to the EventLogging config right now..) but I think we
still need to escape whatever character we choose as delimiter in HTTP headers
--
Ori Livneh
Forwarded message:
From: Erik Zachte <ezachte(a)wikimedia.org>
Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest
in Wikipedia and analytics. <analytics(a)lists.wikimedia.org>
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics. <analytics(a)lists.wikimedia.org>
Date: Sunday, January 27, 2013 5:07:50 AM
Subject: Re: [Analytics] RFC: Tab as field delimiter in logging format of cache servers
We had major data loss on two occasions because of misaligned fields. Broken
records just fell through and went unnoticed, causing up to some 20% data
loss for months. There was embarrassment all over. It took many days to
repair the data as much as possible. So if we fix this, let us make it
really robust, without causing too much overhead.
If we bet tabs will never be introduced in the input stream, even where they
are legal, and we take that approach on several issues, one of those
unlikely events will surely happen sooner than you think (these are not
random mutations, somewhere someone follows the same logic of making another
layer more robust, call it evolutionary convergence).
The code to change existing tabs into some less obnoxious character is dead
trivial, hardly any overhead. At worst one field will then be affected, not
the whole record, which makes it easier to spot and debug the anomaly when
it happens.
Scanning an input record for tabs and raising a counter is also very
efficient. Sending one alert hourly based on this counter should make us
aware soon enough when this issue needs follow-up, yet without causing
bottle necks.
Erik
-----Original Message-----
From: analytics-bounces(a)lists.wikimedia.org
(mailto:analytics-bounces@lists.wikimedia.org)
[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Ori Livneh
Sent: Sunday, January 27, 2013 8:14 AM
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] RFC: Tab as field delimiter in logging format of
cache servers
On Saturday, January 26, 2013 at 12:57 PM, Diederik van Liere wrote:
Now there could be a tab in a header value as
well but I have never seen
it in our logfiles and i also grepped for it on a couple of random files and
found no such occurrences. So we are not going to escape tab characters in
fields unless new information changes our mind.
It would be nice the eliminate this worry categorically. I checked, and it
appears that varnishncsa and varnishlog do not escape tabs.
How I tested:
varnishd -a :10200 -b 173.194.79.104 -F
This will start a varnish instance on port 10200 that uses Google as a
back-end. Then:
varnishncsa
And in another shell:
curl -I --user-agent QQQQ$'\t'ZZZZ
http://127.0.0.1:10200
(You can also add a tab to the command line by typing Ctrl-V + TAB.)
The output of varnishncsa is:
127.0.0.1 - - [26/Jan/2013:18:12:28 -0800] "HEAD
http://127.0.0.1:10200/
HTTP/1.1" 200 0 "-" "QQQQ ZZZZ"
So the tab is not escaped.
According to RFC 2616 (see
<http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2>), tabs are
permitted in HTTP headers. In particular, a leading tab can be used to
construct multi-line header strings. I have no idea how common this is (I
suspect it's pretty rare), but who knows.
I don't think means you have to ditch tabs -- I doubt there's a problem-free
delimiter. But you should write code and configure software with the
expectation that literal tabs will be encountered so that you can deal with
it gracefully.
--
Ori Livneh
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/analytics