New subject: Wikimedia's anti-surveillance plans: traffic analysis resistance

17 Aug 2013

(Please see the thread titled "Wikimedia's anti-surveillance plans: site 
hardening" for who I am and some general context.)

Once Wikipedia is up to snuff with all the site-hardening I recommended 
in the other thread, there remain two significant information leaks (and 
probably others, but these two are gonna be a big project all by 
themselves, so let's worry about them first).  One is hostnames, and the 
other is page(+resource) length.

Server hostnames are transmitted over the net in cleartext even when TLS 
is in use (because DNS operates in cleartext, and because the cleartext 
portion of the TLS handshake includes the hostname, so the server knows 
which certificate to send down).  The current URL structure of 
*.wiki[pm]edia.org exposes sensitive information in the server hostname: 
for Wikipedia it's the language tag, for Wikimedia the subproject. 
Language seems like a serious exposure to me, potentially enough all by 
itself to finger a specific IP address as associated with a specific 
Wikipedia user handle.  I realize how disruptive this would be, but I 
think we need to consider changing the canonical Wikipedia URL format to 
https://wikipedia.org/LANGUAGE/PAGENAME.

For *.wikimedia.org it is less obvious what should be done. That domain 
makes use of subdomain partitioning to control the same-origin policy 
(for instance, upload.wikimedia.org needs to be a distinct hostname from 
everything else, lest someone upload e.g. a malicious SVG that 
exfiltrates your session cookies) so it cannot be altogether 
consolidated. However, knowing (for instance) whether a particular user 
is even *aware* of Commons or Meta may be enough to finger them, so we 
need to think about *some* degree of consolidation.

---

Just how much information is exposed by page length (and how to best 
mitigate it) is a live area of basic research. It happens to be *my* 
area of basic research, and I would be interested in collaborating with 
y'all on locking it down (it would make a spiffy case study for my 
thesis :-) but I must emphasize that *we don't know if it is possible to 
prevent this attack*.

I recommend that everyone interested in this topic read these articles: 
http://hal.archives-ouvertes.fr/docs/00/74/78/41/PDF/johnny2hotpet-finalcam… 
discusses why Web browsing history is sensitive information in general. 
http://kpdyer.com/publications/oakland2012.pdf and 
http://www.freehaven.net/anonbib/cache/ccs2012-fingerprinting.pdf 
demonstrate how page length can reveal page identity, debunk a number of 
"easy" fixes, and their reference lists are good portals to the 
literature. Finally, 
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf demonstrates a 
related but perhaps even more insidious attack, whereby the eavesdropper 
learns the *user identity* of someone on a social network by virtue of 
the size of their profile photo.

This last article raises a critical point. To render Wikipedia genuinely 
secure against traffic analysis, it is not sufficient for the 
eavesdropper to be unable to identify *which pages* are being read or 
edited. The eavesdropper may also be able to learn and make use of the 
answers to questions such as:

  * Given an IP address known to be communicating with WP/WM, whether
    or not there is a logged-in user responsible for the traffic.
  * Assuming it is known that a logged-in user is responsible for some
    traffic, *which user it is* (User: handle) or whether the user has
    any special privileges.
  * State transitions between uncredentialed and logged-in (in either
    direction).
  * State transitions between reading and editing.

This is unlikely to be an exhaustive list. If we are serious about 
defending about traffic analysis, one of the first things we should do 
is have a bunch of experienced editors and developers sit down and work 
out an exhaustive list of things we don't want to reveal. (I have only 
ever dabbled in editing Wikipedia.)

Now, once this is pinned down, theoretically, yes, the cure is padding. 
However, the padding inherent in TLS block cipher modes is *not* 
adequate; it's normally strictly "round up to the nearest multiple of 16 
bytes", which has been shown to be completely inadequate.  One of the 
above papers talks about patching GnuTLS to pad randomly by up to 256 
bytes, but this too is probably insufficient.

Random padding, in fact, is no good at all. The adversary can simply 
average over many pageloads and extract the true length. What's actually 
needed is to *bin* page (+resource) sizes such that any given load could 
be a substantial number of different pages. 
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf also discusses how 
this can be done in principle. The project - and I emphasize that it 
would be a *project* - would be to arrange for MediaWiki (the software) 
to do this binning automatically, such that the adversary cannot learn 
anything useful either from individual traffic bursts or from a sequence 
of such bursts, without bulking up overall data transfer sizes too much.

Again, this is something I am interested in helping with, provided it is 
understood that this is a live research question, that success is by no 
means guaranteed, and that success may come at the expense of other 
desirable qualities (such as not transmitting resources only needed for 
the editor until someone actually clicks an edit link).

zw