(Please see the thread titled "Wikimedia's anti-surveillance plans: site hardening" for who I am and some general context.)
Once Wikipedia is up to snuff with all the site-hardening I recommended in the other thread, there remain two significant information leaks (and probably others, but these two are gonna be a big project all by themselves, so let's worry about them first). One is hostnames, and the other is page(+resource) length.
Server hostnames are transmitted over the net in cleartext even when TLS is in use (because DNS operates in cleartext, and because the cleartext portion of the TLS handshake includes the hostname, so the server knows which certificate to send down). The current URL structure of *.wiki[pm]edia.org exposes sensitive information in the server hostname: for Wikipedia it's the language tag, for Wikimedia the subproject. Language seems like a serious exposure to me, potentially enough all by itself to finger a specific IP address as associated with a specific Wikipedia user handle. I realize how disruptive this would be, but I think we need to consider changing the canonical Wikipedia URL format to https://wikipedia.org/LANGUAGE/PAGENAME.
For *.wikimedia.org it is less obvious what should be done. That domain makes use of subdomain partitioning to control the same-origin policy (for instance, upload.wikimedia.org needs to be a distinct hostname from everything else, lest someone upload e.g. a malicious SVG that exfiltrates your session cookies) so it cannot be altogether consolidated. However, knowing (for instance) whether a particular user is even *aware* of Commons or Meta may be enough to finger them, so we need to think about *some* degree of consolidation.
---
Just how much information is exposed by page length (and how to best mitigate it) is a live area of basic research. It happens to be *my* area of basic research, and I would be interested in collaborating with y'all on locking it down (it would make a spiffy case study for my thesis :-) but I must emphasize that *we don't know if it is possible to prevent this attack*.
I recommend that everyone interested in this topic read these articles: http://hal.archives-ouvertes.fr/docs/00/74/78/41/PDF/johnny2hotpet-finalcam.... discusses why Web browsing history is sensitive information in general. http://kpdyer.com/publications/oakland2012.pdf and http://www.freehaven.net/anonbib/cache/ccs2012-fingerprinting.pdf demonstrate how page length can reveal page identity, debunk a number of "easy" fixes, and their reference lists are good portals to the literature. Finally, http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf demonstrates a related but perhaps even more insidious attack, whereby the eavesdropper learns the *user identity* of someone on a social network by virtue of the size of their profile photo.
This last article raises a critical point. To render Wikipedia genuinely secure against traffic analysis, it is not sufficient for the eavesdropper to be unable to identify *which pages* are being read or edited. The eavesdropper may also be able to learn and make use of the answers to questions such as:
* Given an IP address known to be communicating with WP/WM, whether or not there is a logged-in user responsible for the traffic. * Assuming it is known that a logged-in user is responsible for some traffic, *which user it is* (User: handle) or whether the user has any special privileges. * State transitions between uncredentialed and logged-in (in either direction). * State transitions between reading and editing.
This is unlikely to be an exhaustive list. If we are serious about defending about traffic analysis, one of the first things we should do is have a bunch of experienced editors and developers sit down and work out an exhaustive list of things we don't want to reveal. (I have only ever dabbled in editing Wikipedia.)
Now, once this is pinned down, theoretically, yes, the cure is padding. However, the padding inherent in TLS block cipher modes is *not* adequate; it's normally strictly "round up to the nearest multiple of 16 bytes", which has been shown to be completely inadequate. One of the above papers talks about patching GnuTLS to pad randomly by up to 256 bytes, but this too is probably insufficient.
Random padding, in fact, is no good at all. The adversary can simply average over many pageloads and extract the true length. What's actually needed is to *bin* page (+resource) sizes such that any given load could be a substantial number of different pages. http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf also discusses how this can be done in principle. The project - and I emphasize that it would be a *project* - would be to arrange for MediaWiki (the software) to do this binning automatically, such that the adversary cannot learn anything useful either from individual traffic bursts or from a sequence of such bursts, without bulking up overall data transfer sizes too much.
Again, this is something I am interested in helping with, provided it is understood that this is a live research question, that success is by no means guaranteed, and that success may come at the expense of other desirable qualities (such as not transmitting resources only needed for the editor until someone actually clicks an edit link).
zw