On 7/6/06, Abigail Brady morwen@evilmagic.org wrote:
And then the way to stop this is to abstract the logs for the traffic we want, and throw the raw logs away as quickly as possible. Something of the the level of data that google trends can provide to the public is basically the type of thing we'd want to have (broad numbers on the only most popular search terms/pages).
Yep. If the biggest concerns are disk space and privacy, then the answer is obviously to collect logs for short periods of time that look vaguely like:
[[George W Bush]] 130.158.1.4 1/4/2006 12:00 [[Bill Clinton] 130.158.1.4 1/4/2006:12:01 [[George W Bush]] 200.0.0.4 1/4/2006:12:01
then every few hours or even minutes reprocess them into this sort of format: [[George W Bush]] 2 1/4/2006 [[Bill Clinton]] 1 1/4/2006
and discard the original log files. Less disk space (entries that receive less than N hits could even be discarded altogether from the aggregate log) and no privacy concerns.
I understand if there's no one to actually implement this at the moment though.
Steve