On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderarohde@gmail.com wrote:
On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarling@wikimedia.org wrote:
Peter Gervai wrote:
Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/
Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public.
http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format
How much of that is really considered private? IP addresses obviously, anything else?
I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want.
There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom).
Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data).
On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation.
Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those.
Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously.