[Foundation-l] Release of squid log data

Wed Sep 19 22:45:50 UTC 2007

On 9/19/07, SlimVirgin <slimvirgin at gmail.com> wrote:
> On 9/19/07, Luna <lunasantin at gmail.com> wrote:
> > Yes, research is important. Yes, our goal is to spread and increase the sum
> > of human knowledge. But privacy of private data is currently written into
> > policy as being *important*, and I haven't yet seen a compelling reason to
> > change that.
> >
> > This should not be a casual decision. The information security of our
> > editors and readers should be an utmost priority.
> >
> Information security is particularly important given that
> cyberstalking has become an increasing problem on Wikipedia. We're
> currently hearing about several new cases a month, and some of them
> have been quite serious, with editors (usually admins) being contacted
> at their homes, family members threatened with violence, threats to
> contact employers, and so on.
>
> My understanding is that, with the information people are considering
> releasing, it would be possible for someone to work out which editor
> had which IP address, which would be a serious betrayal of trust.
>
> Sarah

Right; any relatively unique edit (to a given article, without many
temporally close-by edits) could be traced from the HTTP operations to
the article edit logs and ID the user involved.  Repeat for all the
users who edit in a given time period... odds are high that this could
be used to effectively mass-checkuser the whole site.  Given a
database dump and the HTTP data stream, one could write a tool to
automatically resolve everything pretty easily.

While I am generally all for editors being more open about their
identities, giving anyone the power to do this is a problem, in my
opinion.  We restrict this level of data access internally rather
strictly; allowing it out in the open, to independent researchers, is
potentially very problematic.  It worries me, and if it worries me, it
certainly will worry those who are more concerned with preserving
pseudonymity and privacy concerns.  They would likely feel that this
is a breach of the explicit or implied privacy policy, and I would
tend to agree with them.

Even replacing IPs with unique hashes or other IDs would allow leakage
of info; one could extend the theoretical tool above, to find all
temporally relatively unique edits by a given unique ID and look in
the database dump for any that were done by someone not logged in.

Also vulnerable to a brute force attack.  There are only 2^32 possible
IPs; that's about 4 billion.  Excluding the rather brutally obvious
complete IP -> hash lookup table method, it would take only about an
hour to search the whole space if your CPU can do a million hashes a
second 2,000 usable OPS/hash or so).  Anyone performing a widespread
search would undoubtedly build the table; it's going to be small (64
bit hash -> 32 GB) compared to modern disks (and some people's
RAM...).

If you salted each IP with a different salt, that would be effective,
but would also require us to generate and store a large secure table
of salts (or, ip -> forward hash).  And it still doesn't get around
the temporally relatively unique edits comparison method.

-- 
-george william herbert
george.herbert at gmail.com