So, just to give context, our HTTP requests take this path:

* varnish log (very small buffer, not permanent)

* varnishkafka

* kafka (small buffer, I think 7 days)

* camus

* refine process (we use IPs at this point to geolocate)

* webrequest table on hdfs (this is the first time they're stored on permanent media, for 60 days)

* other datasets like hourly pageviews aggregates, (IPs are not passed on to these)

So if we wanted to not store them in kafka buffers even, we'd have to give up geolocating. I think a lot of people find this very useful (fundraising, research, ops, reading), so it's unlikely to be removed.

I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something. But even so, it's not so bad, it's only stored for 60 days and we have no other plain IPs anywhere else (like we removed them from Event Logging for example).

On Tue, Nov 8, 2016 at 4:26 PM, James Salsman <jsalsman@gmail.com> wrote:

Are there any reasons to not replace HTTP GET request IP addresses and
proxy information with their SHA-512 secure hash prior to writing them
to permanent media?

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics