On Thu, Jan 14, 2010 at 11:01 AM, David Gerard dgerard@gmail.com wrote:
2010/1/14 Bryan Tong Minh bryan.tongminh@gmail.com:
On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske magnusmanske@googlemail.com wrote:
- log search and SHA1 IP hash (anonymous!)
There are only 2 billion unique addresses and they can all be found in half an hour probably.
A count of search terms, with no IP info at all? Would be more useful than nothing.
(modulo the issue Michael Snow raised re: searches on suppressable names)
Magnus was not suggesting disclosing the IP hash, as far as I can tell. He demonstrating an abundance of caution in suggesting only logging that. (er, well, yea, if he was suggesting disclosing that... we shouldn't do that. Even if we add a secret to the hash, it's risky and allows interesting correlation attacks)
Here is what I would suggest disclosing:
#start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics
Which has first been filtered by: * Canonicalization of strings (at least ascii case folding) * Excluding strings over some length * Excluding searches which did not come from at least 5 distinct IPs during the reporting interval
There will be useful information excluded by this process, e.g. gads of misspellings which came from only two to four unique IPs... but the output would still be *far* more useful no information at all.