On Thu, Jan 14, 2010 at 11:01 AM, David Gerard <dgerard(a)gmail.com> wrote:
2010/1/14 Bryan Tong Minh
<bryan.tongminh(a)gmail.com>om>:
On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
> * log search and SHA1 IP hash (anonymous!)
There are only 2 billion unique addresses and
they can all be found in
half an hour probably.
A count of search terms, with no IP info at all? Would be more useful
than nothing.
(modulo the issue Michael Snow raised re: searches on suppressable names)
Magnus was not suggesting disclosing the IP hash, as far as I can
tell. He demonstrating an abundance of caution in suggesting only
logging that. (er, well, yea, if he was suggesting disclosing that...
we shouldn't do that. Even if we add a secret to the hash, it's risky
and allows interesting correlation attacks)
Here is what I would suggest disclosing:
#start_datetime end_datetime hits search_string
2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people
2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits
...
2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics
Which has first been filtered by:
* Canonicalization of strings (at least ascii case folding)
* Excluding strings over some length
* Excluding searches which did not come from at least 5 distinct IPs
during the reporting interval
There will be useful information excluded by this process, e.g. gads
of misspellings which came from only two to four unique IPs... but the
output would still be *far* more useful no information at all.