[Wikitech-l] Wikimedia's anti-surveillance plans

16 Aug 2013

Hi, I'm a grad student at CMU studying network security in general and 
censorship / surveillance resistance in particular. I also used to work 
for Mozilla, some of you may remember me in that capacity. My friend 
Sumana Harihareswara asked me to comment on Wikimedia's plans for 
hardening the encyclopedia against state surveillance. I've read all of 
the discussion to date on this subject, but it was kinda all over the 
map, so I thought it would be better to start a new thread.

I understand that there is specific interest in making it hard for an 
eavesdropper to identify *which pages* are being read or edited. I'd 
first like to suggest that there are probably dozens of other things a 
traffic-analytic attacker could learn and make use of, such as:

  * Given an IP address known to be communicating with WP/WM, whether
    or not there is a logged-in user responsible for the traffic.
  * Assuming it is known that a logged-in user is responsible for some
    traffic, *which user it is* (User: handle) or whether the user has
    any special privileges.
  * State transitions between uncredentialed and logged-in (in either
    direction).
  * State transitions between reading and editing.

This is unlikely to be an exhaustive list. If we are serious about 
defending about traffic analysis, one of the first things we should do 
is have a bunch of experienced editors and developers sit down and work 
out an exhaustive list of things we don't want to reveal. (I have only 
ever dabbled in editing Wikipedia.)

---

Now, to technical measures. The roadmap at [URL] looks to me to have the 
right shape, but there are some missing things and points of confusion.

The very first step really must be to enable HTTPS unconditionally for 
everyone (whether or not logged in). I saw a couple of people mention 
that this would lock some user groups out of the encyclopedia -- can 
anyone expand on that a little? We're going to have to find a workaround 
for that. If the server ever emits cleartext, the game is over. You 
should probably think about doing SPDY, or whatever they're calling it 
these days, at the same time; it's valuable not only for traffic 
analysis' sake, but because it offers server-side efficiency gains that 
(in theory) should mitigate the overhead of doing TLS for everyone.

After that's done, there's a grab bag of additional security refinements 
that are deployable now or with minimal-to-moderate engineering effort. 
The roadmap mentions Strict Transport Security; that should definitely 
happen. You should also do Content-Security-Policy, as strict as 
possible. I know this can be a huge amount of development effort, but 
the benefits are equally huge - we don't know exactly how it was done, 
but there's an excellent chance CSP on the hidden service would have 
prevented the exploit that got us all talking about this. Certificate 
pinning (possible either via HSTS extensions, or via talking to browser 
vendors and getting them to bake your certificate in) should at least 
cut down on the risk of a compromised CA. Deploying DNSSEC and DANE will 
also help with that. (Nobody consumes DANE information yet, but if you 
make the first move, things might happen very fast on the client side; 
also, if you discover that you can't reasonably deploy DANE, the IETF 
needs to know about it [I would rate it as moderately likely that DANE 
is broken-as-specified].)

Perfect forward secrecy should also be considered at this stage. Folks 
seem to be confused about what PFS is good for. It is *complementary* to 
traffic analysis resistance, but it's not useless in the absence of. 
What it does is provide defense in depth against a server compromise by 
a well-heeled entity who has been logging traffic *contents*. If you 
don't have PFS and the server is compromised, *all* traffic going back 
potentially for years is decryptable, including cleartext passwords and 
other equally valuable info. If you do have PFS, the exposure is limited 
to the session rollover interval.

You should also consider aggressively paring back the set of 
ciphersuites offered by your servers. [...]

And finally, I realize how disruptive this is, but you need to change 
all the URLs so that the hostname does not expose the language tag. 
Server hostnames are cleartext even with HTTPS and SPDY (because they're 
the subject of DNS lookups, and because they are sent both ways in the 
clear as part of the TLS handshake); so even with ubiquitous encryption, 
an eavesdropper can tell which language-specific encyclopedia is being 
read, and that might be enough to finger someone.
My suggested bikeshed color would be 
https://wikipedia.org/LANGUAGE/PAGENAME (i.e. replace /wiki/ with the 
language tag).  It is probably not necessary to do this for Commons, but 
it *is* necessary for metawikis (knowing whether a given IP address ever 
even looks at a metawiki may reveal something important).

---

Once *all of* those things have been done, we could start thinking about 
traffic analysis resistance. I should be clear that this is an active 
research field. Theoretically, yes, what you do is pad. In practice, we 
don't know how much padding is required. I want to address two repeated 
errors from earlier: The padding inherent in TLS block cipher modes is 
"round up to the nearest multiple of 16 bytes", which has been shown to 
be woefully inadequate. It is *theoretically* possible to make TLS 
over-pad, up to a multiple of 256 bytes, but that is still inadequate, 
and no off-the-shelf implementation bothers.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Wikimedia's anti-surveillance plans