Re: [Wikitech-l] Unbreaking statistics

6 Jun 2009


      On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwellgmaxwell@gmail.com wrote:
...
On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderarohde@gmail.com wrote:
There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0;
bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be
uniquely identifying). There is even private data titles if you don't
sanitize carefully
(/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
 There is private data in referrers
(http://rarohde.com/url_that_only_rarohde_would_have_comefrom).
Things which individually do not appear to disclose anything private
can disclose private things (look at the people uniquely identified by
AOL's 'anonymized' search data).
On the flip side, aggregation can take private things (i.e.
useragents; IP info; referrers) and convert it to non-private data:
Top user agents; top referrers; highest traffic ASNs... but becomes
potentially revealing if not done carefully: The 'top' network and
user agent info for a single obscure article in a short time window
may be information from only one or two users, not really an
aggregation.
Things like common paths through the site should be safe so long as
they are not provided with too much temporal resolution, limit
themselves to existing articles, and limit themselves to either really
common paths or breaking paths into two or three node chains and skip
releasing the least common of those.
Generally when dealing with private data you must approach it with the
same attitude that a C coder must take to avoid buffer overflows.
Treat all data as hostile, assume all actions are potentially
dangerous. Try to figure out how to break it, and think deviously.
On reflection I agree with you, though I think the biggest problem
would actually be a case you didn't mention.  If one provided timing
and page view information, then one can almost certainly single out
individual users by correlating the view timing with edit histories.
Okay, so no stripped logs.  The next question becomes what is the
right way to aggregate.  We can A) reinvent the wheel, or B) adapt a
pre-existing log analyzer in a mode to produce clean aggregate data.
While I respect the work of Zachte and others, this might be a case
where B is a better near-term solution.
Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that
started this mess), his AWStats config already suppresses IP info and
aggregates everything into groups that make it very hard to identify
anything personal from.  (There is still a small risk with allowing
users to drill down to pages / requests that are almost never made,
but perhaps that could be turned off.)  AWStats has native support for
Squid logs and is open source.
This is not necessarily the only option, but I suspect that if we gave
it some thought it would be possible to find an off-the-shelf tool
that would be good enough to support many wikis and configurable
enough to satisfy even the GMaxwell's of the world ;-).  huwiki is
actually the 20th largest wiki (by number of edits), so if it worked
for them, then a tool like AWStats can probably work for most of the
projects (which are not EN).
-Robert Rohde

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Unbreaking statistics