Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?
To expand a bit on Dan's answer. For analytics we need raw IPs to do geo location, which is an important bit of information but other than that we really do not need raw IPs for anything else thus far. It is not unheard of us having to redo our pageview processing due to bugs on code or issues within the pipeline so we need to have raw data available for a certain buffer time.
Now, data needed for ops is a different matter having raw IPs is useful to troubleshoot issues that have to do with connection problems, DOS and others. Normally the work ops does troubleshooting issues with incoming traffic needs IPs to be available for some weeks but not months.
Data retention guidelines are documented here: https://meta.wikimedia.org/wiki/Data_retention_guidelines
On Thu, Nov 10, 2016 at 7:00 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
So, just to give context, our HTTP requests take this path:
- varnish log (very small buffer, not permanent)
- varnishkafka
- kafka (small buffer, I think 7 days)
- camus
- refine process (we use IPs at this point to geolocate)
- webrequest table on hdfs (this is the first time they're stored on
permanent media, for 60 days)
- other datasets like hourly pageviews aggregates, (IPs are not passed on
to these)
So if we wanted to not store them in kafka buffers even, we'd have to give up geolocating. I think a lot of people find this very useful (fundraising, research, ops, reading), so it's unlikely to be removed.
I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something. But even so, it's not so bad, it's only stored for 60 days and we have no other plain IPs anywhere else (like we removed them from Event Logging for example).
On Tue, Nov 8, 2016 at 4:26 PM, James Salsman jsalsman@gmail.com wrote:
Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics