Ryan Lane wrote:
... Assuming traffic analysis can be used to determine your browsing habits as they are occurring (which is likely not terribly hard for Wikipedia)
The Google Maps example you linked to works by building a huge database of the exact byte sizes of satellite image tiles. Are you suggesting that we could fingerprint articles by their sizes and/or the sizes of the images they load?
But if so, in your tweet you said padding wouldn't help. But padding would completely obliterate that size information, wouldn't it?
On Thursday, August 1, 2013, James Salsman wrote:
Ryan Lane wrote:
... Assuming traffic analysis can be used to determine your browsing habits as they are occurring (which is likely not terribly hard for
Wikipedia)
The Google Maps example you linked to works by building a huge database of the exact byte sizes of satellite image tiles. Are you suggesting that we could fingerprint articles by their sizes and/or the sizes of the images they load?
Of course. They can easily crawl us, and we provide everything for download. Unlike sites like facebook or google, our content is delivered exactly the same to nearly every user.
But if so, in your tweet you said padding wouldn't help. But padding would completely obliterate that size information, wouldn't it?
Only Opera has pipelining enabled, so resource requests are serial. Also, our resources are delivered from a number of urls (upload, bits, text) making it easier to identify resources. Even with padding you can take the relative size of resources being delivered, and the order of those sizes and get a pretty good idea of the article being viewed. If there's enough data you may be able to identify multiple articles and see if the subsequent article is a link from the previous article, making guesses more accurate. It only takes a single accurate guess for an edit to identify an editor and see their entire edit history.
Proper support of pipelining in browsers or multiplexing in protocols like SPDY would help this situation. There's probably a number of things we can do to improve the situation without pipelining or newer protocols, and we'll likely put some effort into this front. I think this takes priority over PFS as PFS isn't helpful if decryption isn't necessary to track browsing habits.
Of course the highest priority is simply to enable HTTPS by default, as it forces the use of traffic analysis or decryption, which is likely a high enough bar to hinder tracking efforts for a while.
- Ryan
On Aug 1, 2013, at 10:07 PM, Ryan Lane rlane@wikimedia.org wrote:
Also, our resources are delivered from a number of urls (upload, bits, text) making it easier to identify resources. Even with padding you can take the relative size of resources being delivered, and the order of those sizes and get a pretty good idea of the article being viewed. If there's enough data you may be able to identify multiple articles and see if the subsequent article is a link from the previous article, making guesses more accurate. It only takes a single accurate guess for an edit to identify an editor and see their entire edit history.
Proper support of pipelining in browsers or multiplexing in protocols like SPDY would help this situation. There's probably a number of things we can do to improve the situation without pipelining or newer protocols, and we'll likely put some effort into this front. I think this takes priority over PFS as PFS isn't helpful if decryption isn't necessary to track browsing habits.
This needs some proper crypto expert vetting, but...
It would be trivial (both in effort and impact on customer bandwidth) to pad everything to a 1k boundary on https transmission once we get there. A variable length non-significant header field can be used. Forcing such size counts into very large bins will degrade fingerprinting significantly.
It would also not be much more effort or customer impact to pad to the next larger 1k size for a random large fraction of transmissions. One could imagine a user setting where one could opt in or out of that, for example, and perhaps a set of relative inflation scheme sizes one could choose from (10% inflated, 25% inflated, 50%, 50% plus 10% get 1-5 more k of padding, ...).
Even the slightest of these options (under https everywhere) starts to give plausible deniability to someone's browsing; the greater ones would make fingerprinting quite painful, though running a statistical exercise of such options to see how hard it would make it seems useful to understand the effects...
The question is, what is the point of this? Provide very strong user obfuscation? Provide at least minimal individual evidentiary obfuscation from the level of what a US court (for example) might consider scientifically reliable, to block use of that history in trials (even if educated guesses still might be made by law enforcement as to the articles)?
Countermeasures are responses to attain specific goals. What are the goals people care about for such a program, and what are the Foundation willing to consider worth supporting with bandwidth $$ or programmer time? How do we come up with a list of possible goals and prioritize amongst them in both a technical and policy/goals sense?
I believe that PFS will come out higher here as it's cost is really only CPU crunchies and already existent software settings to choose from, and its benefits to long term total obscurability are significant if done right.
No quantity of countermeasures beat inside info, and out-of-band compromise of our main keys ends up being attractive enough as the only logical attack once we start down this road at all past HTTPS-everywhere. One time key compromise is far more likely than realtime compromise of PFS keys as they rotate, though even that is possible given sufficiently motivated successful stealthy subversion. The credible ability to in the end be confident that's not happening is arguably the long term ceiling for how high we can realistically go with countermeasures, and contains operational security and intrusion detection features as its primary limits rather than in-band behavior.
At some point the ops team would need a security team, an IDS team, and a counterintelligence team to watch the other teams, and I don't know if the Foundation cares that much or would find operating that way to be a more comfortable moral and practical stance...
George William Herbert Sent from my iPhone
wikimedia-l@lists.wikimedia.org