Hi, I'm a grad student at CMU studying network security in general and censorship / surveillance resistance in particular. I also used to work for Mozilla, some of you may remember me in that capacity. My friend Sumana Harihareswara asked me to comment on Wikimedia's plans for hardening the encyclopedia against state surveillance. I've read all of the discussion to date on this subject, but it was kinda all over the map, so I thought it would be better to start a new thread.
I understand that there is specific interest in making it hard for an eavesdropper to identify *which pages* are being read or edited. I'd first like to suggest that there are probably dozens of other things a traffic-analytic attacker could learn and make use of, such as:
* Given an IP address known to be communicating with WP/WM, whether or not there is a logged-in user responsible for the traffic. * Assuming it is known that a logged-in user is responsible for some traffic, *which user it is* (User: handle) or whether the user has any special privileges. * State transitions between uncredentialed and logged-in (in either direction). * State transitions between reading and editing.
This is unlikely to be an exhaustive list. If we are serious about defending about traffic analysis, one of the first things we should do is have a bunch of experienced editors and developers sit down and work out an exhaustive list of things we don't want to reveal. (I have only ever dabbled in editing Wikipedia.)
---
Now, to technical measures. The roadmap at [URL] looks to me to have the right shape, but there are some missing things and points of confusion.
The very first step really must be to enable HTTPS unconditionally for everyone (whether or not logged in). I saw a couple of people mention that this would lock some user groups out of the encyclopedia -- can anyone expand on that a little? We're going to have to find a workaround for that. If the server ever emits cleartext, the game is over. You should probably think about doing SPDY, or whatever they're calling it these days, at the same time; it's valuable not only for traffic analysis' sake, but because it offers server-side efficiency gains that (in theory) should mitigate the overhead of doing TLS for everyone.
After that's done, there's a grab bag of additional security refinements that are deployable now or with minimal-to-moderate engineering effort. The roadmap mentions Strict Transport Security; that should definitely happen. You should also do Content-Security-Policy, as strict as possible. I know this can be a huge amount of development effort, but the benefits are equally huge - we don't know exactly how it was done, but there's an excellent chance CSP on the hidden service would have prevented the exploit that got us all talking about this. Certificate pinning (possible either via HSTS extensions, or via talking to browser vendors and getting them to bake your certificate in) should at least cut down on the risk of a compromised CA. Deploying DNSSEC and DANE will also help with that. (Nobody consumes DANE information yet, but if you make the first move, things might happen very fast on the client side; also, if you discover that you can't reasonably deploy DANE, the IETF needs to know about it [I would rate it as moderately likely that DANE is broken-as-specified].)
Perfect forward secrecy should also be considered at this stage. Folks seem to be confused about what PFS is good for. It is *complementary* to traffic analysis resistance, but it's not useless in the absence of. What it does is provide defense in depth against a server compromise by a well-heeled entity who has been logging traffic *contents*. If you don't have PFS and the server is compromised, *all* traffic going back potentially for years is decryptable, including cleartext passwords and other equally valuable info. If you do have PFS, the exposure is limited to the session rollover interval.
You should also consider aggressively paring back the set of ciphersuites offered by your servers. [...]
And finally, I realize how disruptive this is, but you need to change all the URLs so that the hostname does not expose the language tag. Server hostnames are cleartext even with HTTPS and SPDY (because they're the subject of DNS lookups, and because they are sent both ways in the clear as part of the TLS handshake); so even with ubiquitous encryption, an eavesdropper can tell which language-specific encyclopedia is being read, and that might be enough to finger someone. My suggested bikeshed color would be https://wikipedia.org/LANGUAGE/PAGENAME (i.e. replace /wiki/ with the language tag). It is probably not necessary to do this for Commons, but it *is* necessary for metawikis (knowing whether a given IP address ever even looks at a metawiki may reveal something important).
---
Once *all of* those things have been done, we could start thinking about traffic analysis resistance. I should be clear that this is an active research field. Theoretically, yes, what you do is pad. In practice, we don't know how much padding is required. I want to address two repeated errors from earlier: The padding inherent in TLS block cipher modes is "round up to the nearest multiple of 16 bytes", which has been shown to be woefully inadequate. It is *theoretically* possible to make TLS over-pad, up to a multiple of 256 bytes, but that is still inadequate, and no off-the-shelf implementation bothers.