Creating and promoting the use of a closed-source tool, especially one used to detect disruptive editing, runs counter to core Wikimedia community principles.
Making such a tool closed-source prevents the Wikimedia editing community from auditing its use, contesting its decisions, making improvements to it, or learning from its creation. This causes harm to the community.
Open-sourcing a tool such as this could allow an unscrupulous user to connect accounts that are not publicly connected. This is a problem with all sock detection tools. It also causes harm to the community.
The only way to create such a tool that does not harm the community in any way is to make the tool's decision making entirely public while keeping the tool's decisions non-public. This is not possible. However, we can approach that goal using careful engineering and attempt to minimize harm. Things like restricting the interface to CUs, requiring a logged reason for a check, technical barriers against fishing (comparing two known users, not looking for other potential users), not making processed data available publicly, and publishing the entire source code (including code used to load data) can reduce harm.
After all that, if you are not satisfied that harm has been sufficiently reduced, there is only one answer: do not create the tool.
AntiCompositeNumber
On Wed, Aug 5, 2020 at 10:33 PM Amir Sarabadani ladsgroup@gmail.com wrote:
Hey, I have an ethical question that I couldn't answer yet and have been asking around but no definite answer yet so I'm asking it in a larger audience in hope of a solution.
For almost a year now, I have been developing an NLP-based AI system to be able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.
Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.
I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough resources to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.
The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive topics. Disclosing such connections between two accounts can cost people their lives.
So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be able to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?
The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code that produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.
What should we do?
Thanks [1] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [2] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [3] https://en.wikipedia.org/wiki/Wikipedia:SPI -- Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l