Hi, I'm a grad student at CMU studying network security in general and
censorship / surveillance resistance in particular. I also used to work
for Mozilla, some of you may remember me in that capacity. My friend
Sumana Harihareswara asked me to comment on Wikimedia's plans for
hardening the encyclopedia against state surveillance. I've read all of
the discussion to date on this subject, but it was kinda all over the
map, so I thought it would be better to start a new thread.
I understand that there is specific interest in making it hard for an
eavesdropper to identify *which pages* are being read or edited. I'd
first like to suggest that there are probably dozens of other things a
traffic-analytic attacker could learn and make use of, such as:
* Given an IP address known to be communicating with WP/WM, whether
or not there is a logged-in user responsible for the traffic.
* Assuming it is known that a logged-in user is responsible for some
traffic, *which user it is* (User: handle) or whether the user has
any special privileges.
* State transitions between uncredentialed and logged-in (in either
* State transitions between reading and editing.
This is unlikely to be an exhaustive list. If we are serious about
defending about traffic analysis, one of the first things we should do
is have a bunch of experienced editors and developers sit down and work
out an exhaustive list of things we don't want to reveal. (I have only
ever dabbled in editing Wikipedia.)
Now, to technical measures. The roadmap at [URL] looks to me to have the
right shape, but there are some missing things and points of confusion.
The very first step really must be to enable HTTPS unconditionally for
everyone (whether or not logged in). I saw a couple of people mention
that this would lock some user groups out of the encyclopedia -- can
anyone expand on that a little? We're going to have to find a workaround
for that. If the server ever emits cleartext, the game is over. You
should probably think about doing SPDY, or whatever they're calling it
these days, at the same time; it's valuable not only for traffic
analysis' sake, but because it offers server-side efficiency gains that
(in theory) should mitigate the overhead of doing TLS for everyone.
After that's done, there's a grab bag of additional security refinements
that are deployable now or with minimal-to-moderate engineering effort.
The roadmap mentions Strict Transport Security; that should definitely
happen. You should also do Content-Security-Policy, as strict as
possible. I know this can be a huge amount of development effort, but
the benefits are equally huge - we don't know exactly how it was done,
but there's an excellent chance CSP on the hidden service would have
prevented the exploit that got us all talking about this. Certificate
pinning (possible either via HSTS extensions, or via talking to browser
vendors and getting them to bake your certificate in) should at least
cut down on the risk of a compromised CA. Deploying DNSSEC and DANE will
also help with that. (Nobody consumes DANE information yet, but if you
make the first move, things might happen very fast on the client side;
also, if you discover that you can't reasonably deploy DANE, the IETF
needs to know about it [I would rate it as moderately likely that DANE
Perfect forward secrecy should also be considered at this stage. Folks
seem to be confused about what PFS is good for. It is *complementary* to
traffic analysis resistance, but it's not useless in the absence of.
What it does is provide defense in depth against a server compromise by
a well-heeled entity who has been logging traffic *contents*. If you
don't have PFS and the server is compromised, *all* traffic going back
potentially for years is decryptable, including cleartext passwords and
other equally valuable info. If you do have PFS, the exposure is limited
to the session rollover interval.
You should also consider aggressively paring back the set of
ciphersuites offered by your servers. [...]
And finally, I realize how disruptive this is, but you need to change
all the URLs so that the hostname does not expose the language tag.
Server hostnames are cleartext even with HTTPS and SPDY (because they're
the subject of DNS lookups, and because they are sent both ways in the
clear as part of the TLS handshake); so even with ubiquitous encryption,
an eavesdropper can tell which language-specific encyclopedia is being
read, and that might be enough to finger someone.
My suggested bikeshed color would be
(i.e. replace /wiki/ with the
language tag). It is probably not necessary to do this for Commons, but
it *is* necessary for metawikis (knowing whether a given IP address ever
even looks at a metawiki may reveal something important).
Once *all of* those things have been done, we could start thinking about
traffic analysis resistance. I should be clear that this is an active
research field. Theoretically, yes, what you do is pad. In practice, we
don't know how much padding is required. I want to address two repeated
errors from earlier: The padding inherent in TLS block cipher modes is
"round up to the nearest multiple of 16 bytes", which has been shown to
be woefully inadequate. It is *theoretically* possible to make TLS
over-pad, up to a multiple of 256 bytes, but that is still inadequate,
and no off-the-shelf implementation bothers.