On 0, Gregory Maxwell <gmaxwell(a)gmail.com> scribbled:
On 9/14/07, Ilya Haykinson <haykinson(a)gmail.com>
wrote:
If we can find out the
reason they need IP addresses we can craft the data we send them to
satisfy their request. For example:
Two years ago*, when we didn't actually have the data to release, I
proposed a two pronged approach, restated here:
(1) Make as much of the non-private data public as we safely can, this
maximizes the public value of this data and avoids the harm that
picking favorites by sharing valuable data (commercially valuable as
well as a academically valuable) with only certain groups. Plus it
scales much better.
....
In a very strong sense, we can 'safely' make no data available. I went and did a
little research (shucks, now I'm feeling like ArmedBlowfish). Entirely apart from
obvious attacks using this data, like [[traffic analysis]] and all the various attacks Tor
and remailer systems try to protect against, just the database alone is enough to
compromise identities and reveal valuable information - even if you pseudonymize and
remove data, and even if you insert dummy (but statistically valid, so it doesn't
wreck analyses) data.
The obvious example to prove this would be the leak of AOL search queries, but there's
an even better example. It turns out that Iceland has a very large and very well known
national DNA database with which is associated a large quantity of metadata concerning
family trees and what not (a somewhat amusing aside - a professor of mine once described
her visits to Icelander-dominated parties; apparently when Icelanders have nothing better
to chat about, or nothing particular in common, they simply go over their genealogies and
figure out how they are related). Eventually [[Decode Genetics]]'s database was killed
out of privacy concerns
(<http://observer.guardian.co.uk/international/story/0,6903,1217842,00.html> etc.).
This is interesting, yes, but for us the interesting thing is that efforts were made to
anonymous/scrub the data before use. Keeping in mind that the techniques were more
advanced than the ones I've seem suggested here, the efforts failed. Inferences could
be made from the data that broke the security quite easily. I found one particularly
interesting paper on the topic; I quote from the abstract:
"Results: While susceptibility varies, we find that each of the protection methods
studied is deficient in their protection against re-identification. In certain instances
the protection schema itself, such as singly-encrypted pseudonymization, can be leveraged
to compromise privacy even further than simple de-identification permits. In order to
facilitate the future development of privacy protection methods, we provide a
susceptibility comparison of the methods."
"Conclusion: This work illustrates the danger of blindly adopting identity
protection methods for genomic data. Future methods must account for inferences that can
be leaked from the data itself and the environment into which the data is being released
in order to provide guarantees of privacy. While the protection methods reviewed in this
paper provide a base for future protection strategies, our analyses provide guideposts for
the development of provable privacy protecting methods."
("Why Pseudonyms Don’t Anonymize: A Computational Re-identification Analysis of
Genomic Data Privacy Protection Systems";
<http://privacy.cs.cmu.edu/dataprivacy/projects/linkage/lidap-wp19.pdf>.)
--
gwern
contacts Unix Force SUR Flame analysis bank Gamma CBNRC passwd