I'd like to restart the conversation about hardening Wikipedia (or possibly Wikimedia in general) against traffic analysis. I brought this up ... last November, I think, give or take a month? but it got lost in a larger discussion about HTTPS.
For background, the type of attack that it would be nice to be able to prevent is described in this paper: http://sysseclab.informatics.indiana.edu/projects/sidebuster/sidebuster-fina... Someone is eavesdropping on an encrypted connection to LANG.wikipedia.org. (It's not possible to prevent the attacker from learning the DNS name and therefore the language the target reads, short of Tor or similar. It's also not possible to prevent them from noticing accesses to ancillary servers, e.g. Commons for media.) The attacker's goal is to figure out things like
* what page is the target reading? * what _sequence of pages_ is the target reading? (This is actually easier, assuming the attacker knows the internal link graph.) * is the target a logged-in user, and if so, which user? * did the target just edit a page, and if so, which page? * (... y'all are probably better at thinking up these hypotheticals than me ...)
Wikipedia is different from a tax-preparation website (the case study in the above paper) in that all of the content is public, and edit actions are also public. The attacker can therefore correlate their eavesdropping data with observations of Special:RecentChanges and the like. This may mean it is impossible to prevent the attacker from detecting edits. I think it's worth the experiment, though.
What I would like to do, in the short term, is perform a large-scale crawl of one or more of the encyclopedias and measure what the above eavesdropper would observe. I would do this over regular HTTPS, from a documented IP address, both as a logged-in user and an anonymous user. This would capture only the reading experience; I would also like to work with prolific editors to take measurements of the traffic patterns generated by that activity. (Bot edits go via the API, as I understand it, and so are not reflective of "naturalistic" editing by human users.)
With that data in hand, the next phase would be to develop some sort of algorithm for automatically padding HTTP responses to maximize eavesdropper confusion while minimizing overhead. I don't yet know exactly how this would work. I imagine that it would be based on clustering the database into sets of pages with similar length but radically different contents. The output of this would be some combination of changes to MediaWiki core (for instance, to ensure that the overall length of the HTTP response headers does not change when one logs in) and an extension module that actually performs the bulk of the padding. I am not at all a PHP developer, so I would need help from someone who is with this part.
What do you think? I know some of this is vague and handwavey but I hope it is at least a place to start a discussion.
zw