Hi Zack,
Thanks for bringing this up again, this is a very useful discussion to have.
On Thu, Jun 05, 2014 at 12:45:11PM -0400, Zack Weinberg wrote:
- what page is the target reading?
- what _sequence of pages_ is the target reading? (This is actually
easier, assuming the attacker knows the internal link graph.)
The former should be pretty easy too, due to the ancillary requests that you already briefly mentioned.
Because of our domain sharding strategy that places media under a separate domain (upload.wikimedia.org), an adversary would know for a given page (1) the size of the encrypted text response, (2) the count and size of responses to media that were requested immediately after the main text response. This combination would create a pretty unique fingerprint for a lot of pages, especially well-curated pages that would have a fair amount of media embeded into them.
Combine this with the fact that we provide XML dumps of our content and images, plus a live feed of changes in realtime and it should be easy enough for a couple of researchers (let alone state agencies with unlimited resources) to devise an algorithm that calculates these with great accuracy and exposes (at least) reads.
What I would like to do, in the short term, is perform a large-scale crawl of one or more of the encyclopedias and measure what the above eavesdropper would observe. I would do this over regular HTTPS, from a documented IP address, both as a logged-in user and an anonymous user.
I doubt you can create enough traffic to make a difference, so yes, with my operations hat on, sure, you can go ahead. Note that all of our software, production stack/config management and dumps of our content are publically available and free (as in speech) to use, so you or anyone else could even create a replica environment and do this kind of analysis without us ever noticing.
With that data in hand, the next phase would be to develop some sort of algorithm for automatically padding HTTP responses to maximize eavesdropper confusion while minimizing overhead. I don't yet know exactly how this would work. I imagine that it would be based on clustering the database into sets of pages with similar length but radically different contents.
I don't think it'd make sense to involve the database in this at all. It'd make much more sense to postprocess the content (still within MediaWiki, most likely) and pad it to fit in buckets of predefined sizes. You'd also have to take care of padding images as well, as the combination of count/size alone leaks too many bits of information.
However, as others mentioned already, this kind of attack is partially addressed with the introduction of SPDY / HTTP/2.0, which is on our roadmap. A full production deployment, including undoing optimizations such as domain sharding (and SSL+SPDY by default, for everyone) is many months ahead, however it does make me wonder if it makes much sense to spend time focusing on plain HTTPS attacks right now.
Regards, Faidon