[Wikimedia-l] Data privacy, encrypted links and recent change captures

John Vandenberg jayvdb at gmail.com
Mon Dec 30 07:10:50 UTC 2013

We know NSA wants Wikipedia data, as Wikipedia is listed in one of the
NSA slides:


That slide is about HTTP, and the tech staff are moving the
user/reader base to HTTPS.

As we learn more about the NSA programs, we need to consider vectors
other than HTTP for the NSA to obtain the data they want.  And the
userbase needs to be aware of the current risks.

One question from the "Dells are backdored"[sic] thread that is worth
separate consideration is:

Are the Wikimedia transit links encrypted, especially for database replication?
MySQL has replication over SSL, so I assume the answer is Yes.

If not, is this necessary or useful, and feasible ?

However we also need to consider that SSL and other encryption may be
useless against NSA/etc, which means replicating non-public data
should be avoided wherever possible, as it becomes a single point of

Given how public our system is, we don't have a lot of non-public
data, so we might be able to design the architecture so that
information isnt replicated, and also ensure it isnt accessed over
insecure links.  I think the only parts of the dataset that are
private & valuable are
* passwords/login cookies,
* checkuser info - IPs and useragents,
* WMF analytics, which includes readers iirc, and
* hidden/deleted edits
* private wikis and mailing lists

Have I missed any?

Are passwords and/or checkuser info replicated?

Is there a data policy on WMF analytics data which prevents it flowing
over insecure links, and limits what is collected and ensures
destruction of the data within reasonable timeframes?  i.e. how about
not using cookies to track analytics of readers who are on HTTP
instead of HTTPS?

The private wikis can be restricted to https, depending on the value
of the data on those wikis in the wrong hands.  The private mailing
lists will be harder to secure, and at least the English Wikipedia
arbcom list contain a lot of valuable data about contributors.

Regarding hidden/deleted edits, the replication isnt the only source
of this data.  All edits are also exposed via Recent Changes
(https/api/etc) as they occur, and the value of these edits is
determined by the fact they are hidden afterwards (e.g. don't appear
in dumps).  Is there any way to control who is effectively capturing
all edits via Recent Changes?

John Vandenberg

More information about the Wikimedia-l mailing list