(Please see the thread titled "Wikimedia's anti-surveillance plans: site
hardening" for who I am and some general context.)
Once Wikipedia is up to snuff with all the site-hardening I recommended
in the other thread, there remain two significant information leaks (and
probably others, but these two are gonna be a big project all by
themselves, so let's worry about them first). One is hostnames, and the
other is page(+resource) length.
Server hostnames are transmitted over the net in cleartext even when TLS
is in use (because DNS operates in cleartext, and because the cleartext
portion of the TLS handshake includes the hostname, so the server knows
which certificate to send down). The current URL structure of
*.wiki[pm]edia.org exposes sensitive information in the server hostname:
for Wikipedia it's the language tag, for Wikimedia the subproject.
Language seems like a serious exposure to me, potentially enough all by
itself to finger a specific IP address as associated with a specific
Wikipedia user handle. I realize how disruptive this would be, but I
think we need to consider changing the canonical Wikipedia URL format to
https://wikipedia.org/LANGUAGE/PAGENAME.
For *.wikimedia.org it is less obvious what should be done. That domain
makes use of subdomain partitioning to control the same-origin policy
(for instance,
upload.wikimedia.org needs to be a distinct hostname from
everything else, lest someone upload e.g. a malicious SVG that
exfiltrates your session cookies) so it cannot be altogether
consolidated. However, knowing (for instance) whether a particular user
is even *aware* of Commons or Meta may be enough to finger them, so we
need to think about *some* degree of consolidation.
---
Just how much information is exposed by page length (and how to best
mitigate it) is a live area of basic research. It happens to be *my*
area of basic research, and I would be interested in collaborating with
y'all on locking it down (it would make a spiffy case study for my
thesis :-) but I must emphasize that *we don't know if it is possible to
prevent this attack*.
I recommend that everyone interested in this topic read these articles:
http://hal.archives-ouvertes.fr/docs/00/74/78/41/PDF/johnny2hotpet-finalcam…
discusses why Web browsing history is sensitive information in general.
http://kpdyer.com/publications/oakland2012.pdf and
http://www.freehaven.net/anonbib/cache/ccs2012-fingerprinting.pdf
demonstrate how page length can reveal page identity, debunk a number of
"easy" fixes, and their reference lists are good portals to the
literature. Finally,
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf demonstrates a
related but perhaps even more insidious attack, whereby the eavesdropper
learns the *user identity* of someone on a social network by virtue of
the size of their profile photo.
This last article raises a critical point. To render Wikipedia genuinely
secure against traffic analysis, it is not sufficient for the
eavesdropper to be unable to identify *which pages* are being read or
edited. The eavesdropper may also be able to learn and make use of the
answers to questions such as:
* Given an IP address known to be communicating with WP/WM, whether
or not there is a logged-in user responsible for the traffic.
* Assuming it is known that a logged-in user is responsible for some
traffic, *which user it is* (User: handle) or whether the user has
any special privileges.
* State transitions between uncredentialed and logged-in (in either
direction).
* State transitions between reading and editing.
This is unlikely to be an exhaustive list. If we are serious about
defending about traffic analysis, one of the first things we should do
is have a bunch of experienced editors and developers sit down and work
out an exhaustive list of things we don't want to reveal. (I have only
ever dabbled in editing Wikipedia.)
Now, once this is pinned down, theoretically, yes, the cure is padding.
However, the padding inherent in TLS block cipher modes is *not*
adequate; it's normally strictly "round up to the nearest multiple of 16
bytes", which has been shown to be completely inadequate. One of the
above papers talks about patching GnuTLS to pad randomly by up to 256
bytes, but this too is probably insufficient.
Random padding, in fact, is no good at all. The adversary can simply
average over many pageloads and extract the true length. What's actually
needed is to *bin* page (+resource) sizes such that any given load could
be a substantial number of different pages.
http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf also discusses how
this can be done in principle. The project - and I emphasize that it
would be a *project* - would be to arrange for MediaWiki (the software)
to do this binning automatically, such that the adversary cannot learn
anything useful either from individual traffic bursts or from a sequence
of such bursts, without bulking up overall data transfer sizes too much.
Again, this is something I am interested in helping with, provided it is
understood that this is a live research question, that success is by no
means guaranteed, and that success may come at the expense of other
desirable qualities (such as not transmitting resources only needed for
the editor until someone actually clicks an edit link).
zw