On 7/6/06, Abigail Brady <morwen(a)evilmagic.org> wrote:
And then the way to stop this is to abstract the logs
for the traffic
we want, and throw the raw logs away as quickly as possible.
Something of the the level of data that google trends can provide to
the public is basically the type of thing we'd want to have (broad
numbers on the only most popular search terms/pages).
Yep. If the biggest concerns are disk space and privacy, then the
answer is obviously to collect logs for short periods of time that
look vaguely like:
[[George W Bush]] 130.158.1.4 1/4/2006 12:00
[[Bill Clinton] 130.158.1.4 1/4/2006:12:01
[[George W Bush]] 200.0.0.4 1/4/2006:12:01
then every few hours or even minutes reprocess them into this sort of format:
[[George W Bush]] 2 1/4/2006
[[Bill Clinton]] 1 1/4/2006
and discard the original log files. Less disk space (entries that
receive less than N hits could even be discarded altogether from the
aggregate log) and no privacy concerns.
I understand if there's no one to actually implement this at the moment though.
Steve