On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohde<rarohde(a)gmail.com> wrote:
On Fri, Jun 5, 2009 at 6:38 PM, Tim
Starling<tstarling(a)wikimedia.org> wrote:
Peter Gervai wrote:
Is there a possibility to write a code which
process raw squid data?
Who do I have to bribe? :-/
Yes it's possible. You just need to write a script that accepts a log
stream on stdin and builds the aggregate data from it. If you want
access to IP addresses, it needs to run on our own servers with only
anonymised data being passed on to the public.
http://wikitech.wikimedia.org/view/Squid_logging
http://wikitech.wikimedia.org/view/Squid_log_format
How much of that is really considered private? IP addresses
obviously, anything else?
I'm wondering if a cheap and dirty solution (at least for the low
traffic wikis) might be to write a script that simply scrubs the
private information and makes the rest available for whatever
applications people might want.
There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0;
bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be
uniquely identifying). There is even private data titles if you don't
sanitize carefully
(/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
There is private data in referrers
(
http://rarohde.com/url_that_only_rarohde_would_have_comefrom).
Things which individually do not appear to disclose anything private
can disclose private things (look at the people uniquely identified by
AOL's 'anonymized' search data).
On the flip side, aggregation can take private things (i.e.
useragents; IP info; referrers) and convert it to non-private data:
Top user agents; top referrers; highest traffic ASNs... but becomes
potentially revealing if not done carefully: The 'top' network and
user agent info for a single obscure article in a short time window
may be information from only one or two users, not really an
aggregation.
Things like common paths through the site should be safe so long as
they are not provided with too much temporal resolution, limit
themselves to existing articles, and limit themselves to either really
common paths or breaking paths into two or three node chains and skip
releasing the least common of those.
Generally when dealing with private data you must approach it with the
same attitude that a C coder must take to avoid buffer overflows.
Treat all data as hostile, assume all actions are potentially
dangerous. Try to figure out how to break it, and think deviously.