We know NSA wants Wikipedia data, as Wikipedia is listed in one of the NSA slides:
https://commons.wikimedia.org/wiki/File:KS8-001.jpg
That slide is about HTTP, and the tech staff are moving the user/reader base to HTTPS.
As we learn more about the NSA programs, we need to consider vectors other than HTTP for the NSA to obtain the data they want. And the userbase needs to be aware of the current risks.
One question from the "Dells are backdored"[sic] thread that is worth separate consideration is:
Are the Wikimedia transit links encrypted, especially for database replication? MySQL has replication over SSL, so I assume the answer is Yes.
If not, is this necessary or useful, and feasible ?
However we also need to consider that SSL and other encryption may be useless against NSA/etc, which means replicating non-public data should be avoided wherever possible, as it becomes a single point of failure.
Given how public our system is, we don't have a lot of non-public data, so we might be able to design the architecture so that information isnt replicated, and also ensure it isnt accessed over insecure links. I think the only parts of the dataset that are private & valuable are * passwords/login cookies, * checkuser info - IPs and useragents, * WMF analytics, which includes readers iirc, and * hidden/deleted edits * private wikis and mailing lists
Have I missed any?
Are passwords and/or checkuser info replicated?
Is there a data policy on WMF analytics data which prevents it flowing over insecure links, and limits what is collected and ensures destruction of the data within reasonable timeframes? i.e. how about not using cookies to track analytics of readers who are on HTTP instead of HTTPS?
The private wikis can be restricted to https, depending on the value of the data on those wikis in the wrong hands. The private mailing lists will be harder to secure, and at least the English Wikipedia arbcom list contain a lot of valuable data about contributors.
Regarding hidden/deleted edits, the replication isnt the only source of this data. All edits are also exposed via Recent Changes (https/api/etc) as they occur, and the value of these edits is determined by the fact they are hidden afterwards (e.g. don't appear in dumps). Is there any way to control who is effectively capturing all edits via Recent Changes?
-- John Vandenberg