Re: [Wikitech-l] Unbreaking statistics

6 Jun 2009

On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohde&lt;rarohde(a)gmail.com&gt; wrote:
...
  On Fri, Jun 5, 2009 at 6:38 PM, Tim
Starling&lt;tstarling(a)wikimedia.org&gt; wrote:
  Peter Gervai wrote:
  Is there a possibility to write a code which
process raw squid data?
 Who do I have to bribe? :-/ 
 Yes it's possible. You just need to write a script that accepts a log
 stream on stdin and builds the aggregate data from it. If you want
 access to IP addresses, it needs to run on our own servers with only
 anonymised data being passed on to the public.

 http://wikitech.wikimedia.org/view/Squid_logging
 http://wikitech.wikimedia.org/view/Squid_log_format

 How much of that is really considered private?  IP addresses
 obviously, anything else?

 I'm wondering if a cheap and dirty solution (at least for the low
 traffic wikis) might be to write a script that simply scrubs the
 private information and makes the rest available for whatever
 applications people might want. 
There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0;
bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be
uniquely identifying). There is even private data titles if you don't
sanitize carefully
(/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
 There is private data in referrers
(http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

Things which individually do not appear to disclose anything private
can disclose private things (look at the people uniquely identified by
AOL's 'anonymized' search data).

On the flip side, aggregation can take private things (i.e.
useragents; IP info; referrers) and convert it to non-private data:
Top user agents; top referrers; highest traffic ASNs... but becomes
potentially revealing if not done carefully: The 'top' network and
user agent info for a single obscure article in a short time window
may be information from only one or two users, not really an
aggregation.

Things like common paths through the site should be safe so long as
they are not provided with too much temporal resolution, limit
themselves to existing articles, and limit themselves to either really
common paths or breaking paths into two or three node chains and skip
releasing the least common of those.

Generally when dealing with private data you must approach it with the
same attitude that a C coder must take to avoid buffer overflows.
Treat all data as hostile, assume all actions are potentially
dangerous. Try to figure out how to break it, and think deviously.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Unbreaking statistics