On Thu, Jun 5, 2014 at 9:45 AM, Zack Weinberg <zackw(a)cmu.edu> wrote:
I'd like to restart the conversation about
hardening Wikipedia (or
possibly Wikimedia in general) against traffic analysis. I brought
this up ... last November, I think, give or take a month? but it got
lost in a larger discussion about HTTPS.
Thanks Zack, I think this is research that needs to happen, but the WMF
doesn't have the resources to do itself right now. I'm very interested in
seeing the results you come up with.
For background, the type of attack that it would be nice to be able to
prevent is described in this paper:
http://sysseclab.informatics.indiana.edu/projects/sidebuster/sidebuster-fin…
Someone is eavesdropping on an encrypted connection to
LANG.wikipedia.org. (It's not possible to prevent the attacker from
learning the DNS name and therefore the language the target reads,
short of Tor or similar. It's also not possible to prevent them from
noticing accesses to ancillary servers, e.g. Commons for media.) The
attacker's goal is to figure out things like
* what page is the target reading?
* what _sequence of pages_ is the target reading? (This is actually
easier, assuming the attacker knows the internal link graph.)
* is the target a logged-in user, and if so, which user?
* did the target just edit a page, and if so, which page?
* (... y'all are probably better at thinking up these hypotheticals than
me ...)
Anything in the logs-- Account creation is probably an easy target.
Wikipedia is different from a tax-preparation website (the case study
in the above paper) in that all of the content is public, and edit
actions are also public. The attacker can therefore correlate their
eavesdropping data with observations of Special:RecentChanges and the
like. This may mean it is impossible to prevent the attacker from
detecting edits. I think it's worth the experiment, though.
What I would like to do, in the short term, is perform a large-scale
crawl of one or more of the encyclopedias and measure what the above
eavesdropper would observe. I would do this over regular HTTPS, from
a documented IP address, both as a logged-in user and an anonymous
user. This would capture only the reading experience; I would also
like to work with prolific editors to take measurements of the traffic
patterns generated by that activity. (Bot edits go via the API, as I
understand it, and so are not reflective of "naturalistic" editing by
human users.)
Make sure to respect typical bot rate limits. Anonymous crawling should be
fine, although logged in crawling could cause issues. But if you're doing
this from a single machine, I don't think there's too much harm you can do.
Thanks for warning us in advance!
Also, mobile looks very different from desktop. May be worth analyzing it
as well.
With that data in hand, the next phase would be to develop some sort
of algorithm for automatically padding HTTP responses to maximize
eavesdropper confusion while minimizing overhead. I don't yet know
exactly how this would work. I imagine that it would be based on
clustering the database into sets of pages with similar length but
radically different contents. The output of this would be some
combination of changes to MediaWiki core (for instance, to ensure that
the overall length of the HTTP response headers does not change when
one logs in) and an extension module that actually performs the bulk
of the padding. I am not at all a PHP developer, so I would need help
from someone who is with this part.
Padding the page in output page would be a pretty simple extension,
although ensuring the page size after the web server is gzips it is a
specific size would be more difficult to do efficiently. However, iirc the
most obvious fingerprinting technique was just looking at the number and
sizes of images loaded from commons. Making sure those are consistent sizes
is likely going to be hard.
What do you think? I know some of this is vague and handwavey but I
hope it is at least a place to start a discussion.
One more thing to take into account is that the WMF is likely going to
switch to spdy, which will completely change the characteristics of the
traffic. So developing a solid process that you can repeat next year would
be time well spent.
zw
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l