Re: [Wikitech-l] Unbreaking statistics

6 Jun 2009

Scrubbing log files to make the data private is hard work. You'd be
impressed by what researchers have been able to do - taking purportedly
anonymous data and using it to identify users en masse by correlating it
with publicly available data from other sites such as Amazon, Facebook and
Netflix. Make no doubt - if you don't do it carefully you will be the target
of, in the best of cases, an academic researcher who wants to prove that you
don't understand statistics.

On Fri, Jun 5, 2009 at 8:13 PM, Robert Rohde &lt;rarohde(a)gmail.com&gt; wrote:

...
  On Fri, Jun 5, 2009 at 6:38 PM, Tim
Starling&lt;tstarling(a)wikimedia.org&gt;
 wrote:
  Peter Gervai wrote:
  Is there a possibility to write a code which
process raw squid data?
 Who do I have to bribe? :-/ 
 Yes it's possible. You just need to write a script that accepts a log
 stream on stdin and builds the aggregate data from it. If you want
 access to IP addresses, it needs to run on our own servers with only
 anonymised data being passed on to the public.

 http://wikitech.wikimedia.org/view/Squid_logging
 http://wikitech.wikimedia.org/view/Squid_log_format

 How much of that is really considered private?  IP addresses
 obviously, anything else?

 I'm wondering if a cheap and dirty solution (at least for the low
 traffic wikis) might be to write a script that simply scrubs the
 private information and makes the rest available for whatever
 applications people might want.

 -Robert Rohde

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Unbreaking statistics