Hi Zack,
On Thu, Jun 05, 2014 at 12:45:11PM -0400, Zack Weinberg wrote:
I'd like to restart the conversation about hardening Wikipedia (or possibly Wikimedia in general) against traffic analysis. I brought this up ... last November, I think, give or take a month? but it got lost in a larger discussion about HTTPS.
This sounds like a great idea to me, thanks for thinking about it and sharing it. Privacy of peoples' reading habits is critical, and the more we can do to ensure it the better.
With that data in hand, the next phase would be to develop some sort of algorithm for automatically padding HTTP responses to maximize eavesdropper confusion while minimizing overhead. I don't yet know exactly how this would work. I imagine that it would be based on clustering the database into sets of pages with similar length but radically different contents. The output of this would be some combination of changes to MediaWiki core (for instance, to ensure that the overall length of the HTTP response headers does not change when one logs in) and an extension module that actually performs the bulk of the padding. I am not at all a PHP developer, so I would need help from someone who is with this part.
I'm not a big PHP developer, but given the right project I can be enticed into doing some, and I'd be very happy to help out with this. Ensuring any changes didn't add complexity would be very important, but that should be do-able.
As was mentioned, external resources like variously sized images would probably be the trickiest thing to figure out good ways around. IIRC SPDY has some inlining multiple resources in the same packet sort of stuff, which we might be able to take advantage of to help here (it's been ages since I read about it, though).
Nick