On Mar 20, 2013, at 7:01 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
See my e-mail below and Erik Zachte's reply. I don't mind switching to tabs
(I'm just making the change to the EventLogging config right now..) but I think we
still need to escape whatever character we choose as delimiter in HTTP headers
As for escaping field separators... The logging has been changed to use tabs for some time
already, and we've never been escaping tabs, just spaces. So I'm not changing
/that/ part right now, just disabling the escaping of spaces as well, as there doesn't
seem to be any use anymore.
> From: Erik Zachte <ezachte(a)wikimedia.org>
> We had major data loss on two occasions because of
misaligned fields. Broken
> records just fell through and went unnoticed, causing up to some 20% data
> loss for months. There was embarrassment all over. It took many days to
> repair the data as much as possible. So if we fix this, let us make it
> really robust, without causing too much overhead.
> If we bet tabs will never be introduced in the input stream, even where they
> are legal, and we take that approach on several issues, one of those
> unlikely events will surely happen sooner than you think (these are not
> random mutations, somewhere someone follows the same logic of making another
> layer more robust, call it evolutionary convergence).
> The code to change existing tabs into some less
obnoxious character is dead
> trivial, hardly any overhead. At worst one field will then be affected, not
> the whole record, which makes it easier to spot and debug the anomaly when
> it happens.
> Scanning an input record for tabs and raising a counter is also very
> efficient. Sending one alert hourly based on this counter should make us
> aware soon enough when this issue needs follow-up, yet without causing
> bottle necks.
Replacing tabs by a single other character is a lot better efficiency wise than escaping
indeed. Because it doesn't change the length of the string, thus doesn't need a
new memory allocation and string copy. Still, scanning a string is not completely trivial
overhead (Varnish itself for example tries to avoid it at all cost), and I would prefer to
avoid it if we can. Varnishncsa uses a significant amount of CPU right now, and since
we're reducing the number of Varnish servers and increasing the amount of requests per
box, I'd like to make it as efficient as we can.
I like the idea of monitoring the situation, and acting on it when needed. Since a log
record should always contain a fixed number of tabs (field separators), anything wrong
should be very easy to detect on the log processing side I think.
Thanks for thinking along,
Mark Bergsma <mark(a)wikimedia.org>
Lead Operations Architect