On Mar 20, 2013, at 7:01 PM, Ori Livneh ori@wikimedia.org wrote:
Hey Mark,
See my e-mail below and Erik Zachte's reply. I don't mind switching to tabs (I'm just making the change to the EventLogging config right now..) but I think we still need to escape whatever character we choose as delimiter in HTTP headers
Thanks.
As for escaping field separators... The logging has been changed to use tabs for some time already, and we've never been escaping tabs, just spaces. So I'm not changing /that/ part right now, just disabling the escaping of spaces as well, as there doesn't seem to be any use anymore.
Forwarded message:
From: Erik Zachte ezachte@wikimedia.org
We had major data loss on two occasions because of misaligned fields. Broken records just fell through and went unnoticed, causing up to some 20% data loss for months. There was embarrassment all over. It took many days to repair the data as much as possible. So if we fix this, let us make it really robust, without causing too much overhead.
If we bet tabs will never be introduced in the input stream, even where they are legal, and we take that approach on several issues, one of those unlikely events will surely happen sooner than you think (these are not random mutations, somewhere someone follows the same logic of making another layer more robust, call it evolutionary convergence).
The code to change existing tabs into some less obnoxious character is dead trivial, hardly any overhead. At worst one field will then be affected, not the whole record, which makes it easier to spot and debug the anomaly when it happens.
Scanning an input record for tabs and raising a counter is also very efficient. Sending one alert hourly based on this counter should make us aware soon enough when this issue needs follow-up, yet without causing bottle necks.
Replacing tabs by a single other character is a lot better efficiency wise than escaping indeed. Because it doesn't change the length of the string, thus doesn't need a new memory allocation and string copy. Still, scanning a string is not completely trivial overhead (Varnish itself for example tries to avoid it at all cost), and I would prefer to avoid it if we can. Varnishncsa uses a significant amount of CPU right now, and since we're reducing the number of Varnish servers and increasing the amount of requests per box, I'd like to make it as efficient as we can.
I like the idea of monitoring the situation, and acting on it when needed. Since a log record should always contain a fixed number of tabs (field separators), anything wrong should be very easy to detect on the log processing side I think.
Thanks for thinking along,