My suggestion for how to filter these bots efficiently in c program (no costly nuanced regexps) before sending data to webstatscollector:
a) Find 14th field in space delimited log line = user agent (but beware of false delimiters in logs from varnish, if still applicable) b) Search this field case insensitive for bot/crawler/spider/http (by convention only bots have url in agent string)
That will filter out most bot pollution. We still want those records in sampled log though.
Any thoughts?
I did some research on fast string matching and it seems that the recently developed algorithm by Leonid Volnitsky is very fast (http://volnitsky.com/project/str_search/index.html). I will do some benchmarks vs the ordinary C strstr function but the author claims it's 20x faster.
So instead of hard coding where the bot information should be, just search the entire logline for the bot information and if it is present discard the logline and else process as-is.
Best, Diederik