(Please see the thread titled "Wikimedia's anti-surveillance plans: site hardening" for who I am and some general context.)
Once Wikipedia is up to snuff with all the site-hardening I recommended in the other thread, there remain two significant information leaks (and probably others, but these two are gonna be a big project all by themselves, so let's worry about them first). One is hostnames, and the other is page(+resource) length.
Server hostnames are transmitted over the net in cleartext even when TLS is in use (because DNS operates in cleartext, and because the cleartext portion of the TLS handshake includes the hostname, so the server knows which certificate to send down). The current URL structure of *.wiki[pm]edia.org exposes sensitive information in the server hostname: for Wikipedia it's the language tag, for Wikimedia the subproject. Language seems like a serious exposure to me, potentially enough all by itself to finger a specific IP address as associated with a specific Wikipedia user handle. I realize how disruptive this would be, but I think we need to consider changing the canonical Wikipedia URL format to https://wikipedia.org/LANGUAGE/PAGENAME.
For *.wikimedia.org it is less obvious what should be done. That domain makes use of subdomain partitioning to control the same-origin policy (for instance, upload.wikimedia.org needs to be a distinct hostname from everything else, lest someone upload e.g. a malicious SVG that exfiltrates your session cookies) so it cannot be altogether consolidated. However, knowing (for instance) whether a particular user is even *aware* of Commons or Meta may be enough to finger them, so we need to think about *some* degree of consolidation.
---
Just how much information is exposed by page length (and how to best mitigate it) is a live area of basic research. It happens to be *my* area of basic research, and I would be interested in collaborating with y'all on locking it down (it would make a spiffy case study for my thesis :-) but I must emphasize that *we don't know if it is possible to prevent this attack*.
I recommend that everyone interested in this topic read these articles: http://hal.archives-ouvertes.fr/docs/00/74/78/41/PDF/johnny2hotpet-finalcam.... discusses why Web browsing history is sensitive information in general. http://kpdyer.com/publications/oakland2012.pdf and http://www.freehaven.net/anonbib/cache/ccs2012-fingerprinting.pdf demonstrate how page length can reveal page identity, debunk a number of "easy" fixes, and their reference lists are good portals to the literature. Finally, http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf demonstrates a related but perhaps even more insidious attack, whereby the eavesdropper learns the *user identity* of someone on a social network by virtue of the size of their profile photo.
This last article raises a critical point. To render Wikipedia genuinely secure against traffic analysis, it is not sufficient for the eavesdropper to be unable to identify *which pages* are being read or edited. The eavesdropper may also be able to learn and make use of the answers to questions such as:
* Given an IP address known to be communicating with WP/WM, whether or not there is a logged-in user responsible for the traffic. * Assuming it is known that a logged-in user is responsible for some traffic, *which user it is* (User: handle) or whether the user has any special privileges. * State transitions between uncredentialed and logged-in (in either direction). * State transitions between reading and editing.
This is unlikely to be an exhaustive list. If we are serious about defending about traffic analysis, one of the first things we should do is have a bunch of experienced editors and developers sit down and work out an exhaustive list of things we don't want to reveal. (I have only ever dabbled in editing Wikipedia.)
Now, once this is pinned down, theoretically, yes, the cure is padding. However, the padding inherent in TLS block cipher modes is *not* adequate; it's normally strictly "round up to the nearest multiple of 16 bytes", which has been shown to be completely inadequate. One of the above papers talks about patching GnuTLS to pad randomly by up to 256 bytes, but this too is probably insufficient.
Random padding, in fact, is no good at all. The adversary can simply average over many pageloads and extract the true length. What's actually needed is to *bin* page (+resource) sizes such that any given load could be a substantial number of different pages. http://hal.inria.fr/docs/00/73/29/55/PDF/RR-8067.pdf also discusses how this can be done in principle. The project - and I emphasize that it would be a *project* - would be to arrange for MediaWiki (the software) to do this binning automatically, such that the adversary cannot learn anything useful either from individual traffic bursts or from a sequence of such bursts, without bulking up overall data transfer sizes too much.
Again, this is something I am interested in helping with, provided it is understood that this is a live research question, that success is by no means guaranteed, and that success may come at the expense of other desirable qualities (such as not transmitting resources only needed for the editor until someone actually clicks an edit link).
zw
On Fri, Aug 16, 2013 at 8:04 PM, Zack Weinberg zackw@cmu.edu wrote:
Wikipedia user handle. I realize how disruptive this would be, but I think we need to consider changing the canonical Wikipedia URL format to https://wikipedia.org/**LANGUAGE/PAGENAMEhttps://wikipedia.org/LANGUAGE/PAGENAME .
Note that LANGUAGE.wikipedia.org/VARIANT/PAGENAME is already in use for wikis which use the language variant conversion code, such as zhwiki. Usually LANGUAGE is a prefix of VARIANT, for example zh-hans, zh-hant, en-us, en-gb, sr, sr-ec.
If you wanted to approach this goal, one could start by creating a proxy service at https://secure.wikipedia.org/LANGUAGE-VARIANT/PAGENAME that did an internal proxy of pages from https://LANGUAGE.wikipedia.org/LANGUAGE-VARIANT/PAGENAME. That would allow some low-risk being-bold exploration of the different implications.
This last article raises a critical point. To render Wikipedia genuinely secure against traffic analysis
Whenever someone seems to veer into discussion of absolute security I get nervous. It would be best to begin with asking "how can we make attacks more expensive"?
Given the contents of the most recent NSA document leaks, it seems like it is also worthwhile to attempt to confound the "are we at least 51% certain that this user is not an American" question. It does seem like combining wikis is a worthwhile step here. I wonder if any arbitrary user of zhwiki (for example) would automatically be assumed >51% chance of being non-American.
Random padding, in fact, is no good at all. The adversary can simply
average over many pageloads and extract the true length.
Again, "no good at all" slides into this "absolute security" fallacy. *How much more difficult* does padding make things? *How many* more pageloads? The adversaries with infinite resources can also legally compel the sysop to compromise the server. But can we improve the situation for medium-sized state actors, or raise the bar so that only targeted users can be compromised (instead of passively collecting information on all users)?
As a start on constructing a better threat model, let me offer two scenarios:
a) NSA passive collection of all traffic to/from wikipedia (XKEYSCORE). It would be nice to frustrate this so that (as a start) only traffic from targeted users could be effectively collected -- for example, by requiring an active MIM attack instead of a passive tap.
b) Great Firewall monitoring of specific pages (Tienanmen square, Falun Gong). Can we better protect the identities of readers of these pages? Can we protect the identities of editors? Can we frustrate attempts to block specific pages?
Real world issues should also be taken into account. Methods that prevent the Great Firewall from blocking specific pages might provoke a site-wide block. Efforts to force utilization of the latest browsers (which support some new protocol) might disenfranchise mobile users or users for whom poverty and resource limitations are a bigger threat than coercive government. Etc... --scott
On 17/08/13 10:04, Zack Weinberg wrote:
What's actually needed is to *bin* page (+resource) sizes such that any given load could be a substantial number of different pages.
That would certainly help for some sites, but I don't think it would make much difference for Wikipedia. Even if you used power-of-two bins, you would still be leaking tens of bits per page view due to the sizes and request order of the images on each page.
It would mean that instead of just crawling the HTML, the attacker would also have to send a HEAD request for each image. Maybe it would add 50% to the attacker's software development time, but once it was done for one site, it would work against any similar site.
-- Tim Starling
On Sat, Aug 17, 2013 at 12:04 AM, Zack Weinberg zackw@cmu.edu wrote:
- State transitions between reading and editing.
Reads on our projects (whether logged in or out) have very little data coming from the client and a lot being sent back to the client. When a page is saved or previewed (or even in during an edit with calls to the parsoid web service?) or file uploaded then suddenly there's more substantial traffic going from the client up to the projects.
This could be mitigated in part by chunking larger transmissions and sending them over time. Which we already do for some users with the upload wizard. (i.e. file uploads. we could expand that to cover more users) Those chunked transmissions could be then spread out over time and mixed in with transmissions of garbage when there's no pending chunk to send.
That may be OK from a UX perspective for file uploads (run in the background and let the user do other stuff while the upload runs) but I don't think people will want to wait for their edits (or previews!) to go through.
Also, we'd probably need an opt-out option for sending the garbage when idle and we'd have to research the potential impact on bandwidth bills/quotas. And maybe also impact on battery life??
(it couldn't be opt-in because the fact that you had opted in would itself be an indicator that you might be an editor.)
-Jeremy
wikitech-l@lists.wikimedia.org