Tim Starling wrote:
* Provides a TCP query interface
I'd share
the memory hashtable between process, and simply add a
'reader' one. We can live with race conditions, too.
There's only one log stream so I would think there would be only one
process. The task is log analysis, not log collection.
Nope, i was thinking on one writer process, and several 'reader'
processes attaching (read-only) to the mapped data.
* Does all that for 30k req/s using less than 10% CPU
and 2GB memory
You mean 10% of the *cluster CPU*, isn't? ;)
10% of one processor. Maybe we could relax it if that proves to be
impossible, but there will only be one log host for now, and there might
be lots of log analysis tools, so we don't want any CPU hogs. Think C++,
not perl.
I was thinking in pure C...
Impossible?
We could start by profiling. Read
data for 5 minutes, compute it for 25.
This would low the rate to 1k req/s.
I could make a log snippet available for optimisation purposes, but
ultimately, it will have to work on the full stream. Sampling would give
you an unacceptable noise floor for the majority of those 25 million
articles.
Making the program seems fair easy. But ensuring it will cope with all
that data is a bit scaring.
(to nobody in particular) Thinking about the pipe
buffer issues we had the
other day, it might make sense to recompile the kernel on henbane to have
a larger pipe buffer, to cut down on context switches. At 30k req/s, it
would fill every 1.2ms.
-- Tim Starling
The reading process should probably have higher priority than the
colector. Losing UDP packets is better than fulling the pipe (/me
wonders what happens when it gets full. Write failed? SIGPIPE? Nuclear
meltdown?).