My suggestion for how to filter these bots efficiently
in c program (no
costly nuanced regexps) before sending data to webstatscollector:
a) Find 14th field in space delimited log line = user agent (but beware of
false delimiters in logs from varnish, if still applicable)
b) Search this field case insensitive for bot/crawler/spider/http (by
convention only bots have url in agent string)
That will filter out most bot pollution. We still want those records in
sampled log though.
Any thoughts?
I did some research on fast string matching and it seems that the
recently developed algorithm by Leonid Volnitsky
is very fast (
http://volnitsky.com/project/str_search/index.html). I
will do some benchmarks vs the ordinary C strstr function but
the author claims it's 20x faster.
So instead of hard coding where the bot information should be, just
search the entire logline for the bot information and if it is
present discard the logline and else process as-is.
Best,
Diederik