[Wikimedia-l] PRISM

Tue Jun 11 04:46:02 UTC 2013

On 11/06/13 10:41, Anthony wrote:
> One thing I'd also appreciate is that if indeed Wikipedia access logs are
> not even collected in the first place (except for 1/1000 samples), that
> this be stated officially, rather than relying on a two-year-old comment by
> a single, now-former employee.

In October 2012, I introduced an unsampled log of API requests,
including IP addresses. This was in response to a server overload
caused by the API which was very difficult to isolate due to the lack
of meaningful logs. The retention time is currently 30 days.

This means that, among other things, search autocomplete is logged.

The logs are collected at the backend, which means that Squid cache
hits will not be logged. So autocomplete requests for common terms and
prefixes will appear rarely.

This is not a secret -- the changes that made it happen were public at
the time:

https://gerrit.wikimedia.org/r/#/c/24274/
https://gerrit.wikimedia.org/r/#/c/26434/

I'm sure that the other teams (e.g. fundraising, mobile and analytics)
can give you details of what access logs they collect and store.

In general, access logs haven't been stored due to cost, rather than
for any privacy reason. Lots of smaller services (e.g.
blog.wikimedia.org) store access logs.

-- Tim Starling