We had major data loss on two occasions because of misaligned fields. Broken records just fell through and went unnoticed, causing up to some 20% data loss for months. There was embarrassment all over. It took many days to repair the data as much as possible. So if we fix this, let us make it really robust, without causing too much overhead.
If we bet tabs will never be introduced in the input stream, even where they are legal, and we take that approach on several issues, one of those unlikely events will surely happen sooner than you think (these are not random mutations, somewhere someone follows the same logic of making another layer more robust, call it evolutionary convergence).
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
Scanning an input record for tabs and raising a counter is also very efficient. Sending one alert hourly based on this counter should make us aware soon enough when this issue needs follow-up, yet without causing bottle necks.
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Ori Livneh Sent: Sunday, January 27, 2013 8:14 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] RFC: Tab as field delimiter in logging format of cache servers
On Saturday, January 26, 2013 at 12:57 PM, Diederik van Liere wrote:
Now there could be a tab in a header value as well but I have never seen
it in our logfiles and i also grepped for it on a couple of random files and found no such occurrences. So we are not going to escape tab characters in fields unless new information changes our mind.
It would be nice the eliminate this worry categorically. I checked, and it appears that varnishncsa and varnishlog do not escape tabs.
How I tested:
varnishd -a :10200 -b 173.194.79.104 -F
This will start a varnish instance on port 10200 that uses Google as a back-end. Then:
varnishncsa
And in another shell:
curl -I --user-agent QQQQ$'\t'ZZZZ http://127.0.0.1:10200
(You can also add a tab to the command line by typing Ctrl-V + TAB.)
The output of varnishncsa is:
127.0.0.1 - - [26/Jan/2013:18:12:28 -0800] "HEAD http://127.0.0.1:10200/ HTTP/1.1" 200 0 "-" "QQQQ ZZZZ"
So the tab is not escaped.
According to RFC 2616 (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2), tabs are permitted in HTTP headers. In particular, a leading tab can be used to construct multi-line header strings. I have no idea how common this is (I suspect it's pretty rare), but who knows.
I don't think means you have to ditch tabs -- I doubt there's a problem-free delimiter. But you should write code and configure software with the expectation that literal tabs will be encountered so that you can deal with it gracefully.
-- Ori Livneh
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics