Is there a catalog of all data that could possibly be available (for
instance, the mw.session cookie), along with where it is logged, for
how long, and where in various toolchains it gets stripped out?
Not to my knowledge. So, in terms of /readers/, we deliberately have very
little.
Possible vectors I'm aware of:
-the mw.session cookie. This is stripped out of varnishlog before it even
gets to the analytics machines, so presumably doesn't make it past udp2log.
-EventLogging data. For example, data to test how our caching or module
storage is working. We've got some of this for the time period I analysed,
and I'm planning on using the module storage data to test the algorithm,
since it contains a unique identifier independent of IP/UA. This sort of
information is gathered for specific tasks, though, rather than by default,
which I'm kind of happy with: if the existing algorithm is valid I don't
really want to see more PII in our logs. If not, eh, we'll assess how
important session data is outside of academia.
-the UA/IP/lang data
-...that's it.
Obviously these are "vectors I'm aware of" - I am fully open to being
corrected by someone more informed than myself.
Related lists could be useful for planning:
* Limitations our privacy policies place on data gathering (handy when
reviewing those policies)
Indeed; the analytics team is working out how we address data retention as
we speak.
* Studies that are easy and hard given the types of
data we gather
* Wishlists (from external researchers, and from internal staff) of
data-sets that would be useful but aren't currently available. Along
with a sense of priority, complexity, cost.
Yep, these thought experiments are being factored into our data retention
discussion.