[Foundation-l] Release of squid log data

Gwern Branwen gwern0 at gmail.com
Sat Sep 15 04:53:10 UTC 2007


On  0, Gregory Maxwell <gmaxwell at gmail.com> scribbled:
> On 9/14/07, Ilya Haykinson <haykinson at gmail.com> wrote:
> > If we can find out the
> > reason they need IP addresses we can craft the data we send them to
> > satisfy their request.  For example:
>
> Two years ago*, when we didn't actually have the data to release, I
> proposed a two pronged approach, restated here:
>
> (1) Make as much of the non-private data public as we safely can, this
> maximizes the public value of this data and avoids the harm that
> picking favorites by sharing valuable data (commercially valuable as
> well as a academically valuable) with only certain groups. Plus it
> scales much better.
....

In a very strong sense, we can 'safely' make no data available. I went and did a little research (shucks, now I'm feeling like ArmedBlowfish). Entirely apart from obvious attacks using this data, like [[traffic analysis]] and all the various attacks Tor and remailer systems try to protect against, just the database alone is enough to compromise identities and reveal valuable information - even if you pseudonymize and remove data, and even if you insert dummy (but statistically valid, so it doesn't wreck analyses) data.

The obvious example to prove this would be the leak of AOL search queries, but there's an even better example. It turns out that Iceland has a very large and very well known national DNA database with which is associated a large quantity of metadata concerning family trees and what not (a somewhat amusing aside - a professor of mine once described her visits to Icelander-dominated parties; apparently when Icelanders have nothing better to chat about, or nothing particular in common, they simply go over their genealogies and figure out how they are related). Eventually [[Decode Genetics]]'s database was killed out of privacy concerns (<http://observer.guardian.co.uk/international/story/0,6903,1217842,00.html> etc.).

This is interesting, yes, but for us the interesting thing is that efforts were made to anonymous/scrub the data before use. Keeping in mind that the techniques were more advanced than the ones I've seem suggested here, the efforts failed. Inferences could be made from the data that broke the security quite easily. I found one particularly interesting paper on the topic; I quote from the abstract:

 "Results: While susceptibility varies, we find that each of the protection methods studied is deficient in their protection against re-identification. In certain instances the protection schema itself, such as singly-encrypted pseudonymization, can be leveraged to compromise privacy even further than simple de-identification permits. In order to facilitate the future development of privacy protection methods, we provide a susceptibility comparison of the methods."

 "Conclusion: This work illustrates the danger of blindly adopting identity protection methods for genomic data. Future methods must account for inferences that can be leaked from the data itself and the environment into which the data is being released in order to provide guarantees of privacy. While the protection methods reviewed in this paper provide a base for future protection strategies, our analyses provide guideposts for the development of provable privacy protecting methods."

("Why Pseudonyms Don’t Anonymize: A Computational Re-identification Analysis of Genomic Data Privacy Protection Systems"; <http://privacy.cs.cmu.edu/dataprivacy/projects/linkage/lidap-wp19.pdf>.)

--
gwern
contacts Unix Force SUR Flame analysis bank Gamma CBNRC passwd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.wikimedia.org/pipermail/foundation-l/attachments/20070915/3ed6193d/attachment.pgp 


More information about the foundation-l mailing list