Are there any reasons to not replace HTTP GET request
IP addresses and
proxy information with their SHA-512 secure hash prior to writing them
to permanent media?
To expand a bit on Dan's answer.
For analytics we need raw IPs to do geo location, which is an important bit
of information but other than that we really do not need raw IPs for
anything else thus far. It is not unheard of us having to redo our pageview
processing due to bugs on code or issues within the pipeline so we need to
have raw data available for a certain buffer time.
Now, data needed for ops is a different matter having raw IPs is useful to
troubleshoot issues that have to do with connection problems, DOS and
others. Normally the work ops does troubleshooting issues with incoming
traffic needs IPs to be available for some weeks but not months.
Data retention guidelines are documented here:
https://meta.wikimedia.org/wiki/Data_retention_guidelines
On Thu, Nov 10, 2016 at 7:00 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
So, just to give context, our HTTP requests take this
path:
* varnish log (very small buffer, not permanent)
* varnishkafka
* kafka (small buffer, I think 7 days)
* camus
* refine process (we use IPs at this point to geolocate)
* webrequest table on hdfs (this is the first time they're stored on
permanent media, for 60 days)
* other datasets like hourly pageviews aggregates, (IPs are not passed on
to these)
So if we wanted to not store them in kafka buffers even, we'd have to give
up geolocating. I think a lot of people find this very useful
(fundraising, research, ops, reading), so it's unlikely to be removed.
I don't have as clear a reason for why we store the plain IP in
webrequest. I think we could count uniques and all that other stuff with
the IP hash. It's a good question, tentative +1 unless I'm forgetting
something. But even so, it's not so bad, it's only stored for 60 days and
we have no other plain IPs anywhere else (like we removed them from Event
Logging for example).
On Tue, Nov 8, 2016 at 4:26 PM, James Salsman <jsalsman(a)gmail.com> wrote:
Are there any reasons to not replace HTTP GET
request IP addresses and
proxy information with their SHA-512 secure hash prior to writing them
to permanent media?
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics