Ethical question regarding some code

List overview All Threads
Download

newer

older

Deployments for (short) week of...

(no subject)

Amir Sarabadani

6 Aug 2020 6 Aug '20

9:33 a.m.

Hey, I have an ethical question that I couldn't answer yet and have been asking around but no definite answer yet so I'm asking it in a larger audience in hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to be able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough resources to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive topics. Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be able to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code that produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [2] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [3] https://en.wikipedia.org/wiki/Wikipedia:SPI

-- Amir (he/him)

Show replies by date

AntiCompositeNumber

6 Aug 6 Aug

11:06 a.m.

Creating and promoting the use of a closed-source tool, especially one used to detect disruptive editing, runs counter to core Wikimedia community principles.

Making such a tool closed-source prevents the Wikimedia editing community from auditing its use, contesting its decisions, making improvements to it, or learning from its creation. This causes harm to the community.

Open-sourcing a tool such as this could allow an unscrupulous user to connect accounts that are not publicly connected. This is a problem with all sock detection tools. It also causes harm to the community.

The only way to create such a tool that does not harm the community in any way is to make the tool's decision making entirely public while keeping the tool's decisions non-public. This is not possible. However, we can approach that goal using careful engineering and attempt to minimize harm. Things like restricting the interface to CUs, requiring a logged reason for a check, technical barriers against fishing (comparing two known users, not looking for other potential users), not making processed data available publicly, and publishing the entire source code (including code used to load data) can reduce harm.

After all that, if you are not satisfied that harm has been sufficiently reduced, there is only one answer: do not create the tool.

AntiCompositeNumber

On Wed, Aug 5, 2020 at 10:33 PM Amir Sarabadani ladsgroup@gmail.com wrote:

...

Hey, I have an ethical question that I couldn't answer yet and have been asking around but no definite answer yet so I'm asking it in a larger audience in hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to be able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough resources to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive topics. Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be able to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code that produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [2] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [3] https://en.wikipedia.org/wiki/Wikipedia:SPI -- Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gergő Tisza

7 Aug 7 Aug

1:33 a.m.

Technically, you can make the tool open-source and keep the source code secret. That solves the maintenance problem (others who get access can legally modify). Of course, you'd have to trust everyone with access to the files to not publish them which they would be technically entitled to (unless there is some NDA-like mechanism).

Transparency and auditability wouldn't be fulfilled just by making the code public, anyway; they need to be solved by tool design (keeping logs, providing feedback options for the users, trying to expose the components of the decision as much as possible).

I'd agree with Bawolff though that there is probably no point in going to great lengths to keep details secret as creating a similar tool is probably not that hard. You can build some assumptions into the tool which are nontrivial to fulfill outside Toolforge (e.g. use the replicas instead of dumps) to make running it require an effort, at least.

bawolff

6 Aug 6 Aug

11:30 a.m.

That's a tough question, and I'm not sure what the answer is.

There is a little bit of precedent with https://www.mediawiki.org/w/index.php?oldid=2533048&title=Extension:Anti...

When evaluating harm, I guess one of the questions is how does your approach compare in effectiveness to other publicly available approaches like http://www.philocomp.net/humanities/signature.htm & https://github.com/search?q=authorship+attribution+user:pan-webis-de ? (i.e. There is more harm if your approach is significantly better than other already available tools, and less if they're at a similar level)

-- Brian

On Thu, Aug 6, 2020 at 2:33 AM Amir Sarabadani ladsgroup@gmail.com wrote:

...

Hey, I have an ethical question that I couldn't answer yet and have been asking around but no definite answer yet so I'm asking it in a larger audience in hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to be able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough resources to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive topics. Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be able to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code that produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [2]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [3] https://en.wikipedia.org/wiki/Wikipedia:SPI -- Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Thiemo Kreuz

4:14 p.m.

I'm afraid I have to agree with what AntiCompositeNumber wrote. When you set up infrastructure to fight abuse – no matter if that infrastructure is a technical barrier like a captcha, a tool that "blames" people for being sock puppets, or a law – it will affect *all* users, not only the abusers. What you need to think about is not if what you do is right or wrong, but if there is still an acceptable balance between your intended positive effects, and the unavoidable negative effects.

That said, I'm very happy to see something like this being discussed that early. This doesn't always happen. Does anyone still remember discussing "Deep User Inspector"[1][2] in 2013?

Having read what was already said about "harm", I feel there is something missing: AI based tools always have the potential to cause harm simply because people don't really understand what it means to work with such a tool. For example, when the tool says "there is a 95% certainty this is a sock puppet", people will use this as "proof", totally ignoring the fact that the particular case they are looking at could as well be within the 5%. This is the reason why I believe such a tool can not be a toy, open for anyone to play around with, but needs trained users.

TL;DR: Closed source? No. Please avoid at all costs. Closed databases? Sure.

Best Thiemo

[1] https://ricordisamoa.toolforge.org/dui/ [2] https://meta.wikimedia.org/wiki/User_talk:Ricordisamoa#Deep_user_inspector

John Erling Blad

7 Aug 7 Aug

3:49 a.m.

Nice idea! First time I wrote about this being possible was back in 2008-ish.

The problem is quite trivial, you use some observable feature to fingerprint an adversary. The adversary can then game the system if the observable feature can be somehow changed or modified. To avoid this the observable features are usually chosen to be physical properties that can't be easily changed.

In this case the features are word and/or relations between words, and then the question is “Can the adversary change the choice of words?” Yes he can, because the choice of words is not an inherent physical property of the user. In fact there are several programs that help users express themselves in a more fluent way, and such systems will change the observable features i.e. choice of words. The program will move the observable features (the words) from one user-specific distribution to another more program-specific distribution. You will observe the users a priori to be different, but with the program they will be a posteriori more similar.

A real problem is your own poisoning of the training data. That happens when you find some subject to be the same as your postulated one, and then feed the information back into your training data. If you don't do that your training data will start to rot because humans change over time. It is bad anyway you do it.

Even more fun is an adversary that knows what you are doing, and tries to negate your detection algorithm, or even fool you into believing he is someone else. It is after all nothing more than word count and statistics. What will you do when someone edits a Wikipedia-page and your system tells you “This revision is most likely written by Jimbo”?

Several such programs exist, and I'm a bit perplexed that they are not in more use among Wikipedia's editors. Some of them are more aggressive, and can propose quite radical rewrites of the text. I use one of them, and it is not the best, but still it corrects me all the time.

I believe it would be better to create a system where users are internally identified and externally authenticated. (The previous is biometric identification, and must adhere to privacy laws.)

On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani ladsgroup@gmail.com wrote:

...

Hey, I have an ethical question that I couldn't answer yet and have been asking around but no definite answer yet so I'm asking it in a larger audience in hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to be able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough resources to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive topics. Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be able to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code that produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [2]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [3] https://en.wikipedia.org/wiki/Wikipedia:SPI -- Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John Erling Blad

4:23 a.m.

For those interested; the best solution as far as I know for this kind of similarity detection is the Siamese network with RNNs in the first part. That implies you must extract fingerprints for all likely candidates (users) and then some to create a baseline. You can not simply claim that two users (adversary and postulated sock) are the same because they have edited the same page. It is quite unlikely a user will edit the same page with a sock puppet, when it is known that such a system is activated.

On Thu, Aug 6, 2020 at 10:49 PM John Erling Blad jeblad@gmail.com wrote:

...

Nice idea! First time I wrote about this being possible was back in 2008-ish.

The problem is quite trivial, you use some observable feature to fingerprint an adversary. The adversary can then game the system if the observable feature can be somehow changed or modified. To avoid this the observable features are usually chosen to be physical properties that can't be easily changed.

In this case the features are word and/or relations between words, and then the question is “Can the adversary change the choice of words?” Yes he can, because the choice of words is not an inherent physical property of the user. In fact there are several programs that help users express themselves in a more fluent way, and such systems will change the observable features i.e. choice of words. The program will move the observable features (the words) from one user-specific distribution to another more program-specific distribution. You will observe the users a priori to be different, but with the program they will be a posteriori more similar.

A real problem is your own poisoning of the training data. That happens when you find some subject to be the same as your postulated one, and then feed the information back into your training data. If you don't do that your training data will start to rot because humans change over time. It is bad anyway you do it.

Even more fun is an adversary that knows what you are doing, and tries to negate your detection algorithm, or even fool you into believing he is someone else. It is after all nothing more than word count and statistics. What will you do when someone edits a Wikipedia-page and your system tells you “This revision is most likely written by Jimbo”?

Several such programs exist, and I'm a bit perplexed that they are not in more use among Wikipedia's editors. Some of them are more aggressive, and can propose quite radical rewrites of the text. I use one of them, and it is not the best, but still it corrects me all the time.

I believe it would be better to create a system where users are internally identified and externally authenticated. (The previous is biometric identification, and must adhere to privacy laws.)

On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani ladsgroup@gmail.com wrote:

...
Hey, I have an ethical question that I couldn't answer yet and have been asking around but no definite answer yet so I'm asking it in a larger audience in hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to be able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough resources to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive topics. Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be able to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code that produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [2]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [3] https://en.wikipedia.org/wiki/Wikipedia:SPI -- Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

QEDK

4:24 a.m.

I think an important thing to note is that it's public information, so such a model, either better or worse can easily be built by an AI enthusiast. The potential for misuse is not much as it's relatively easy to game, and I don't think that the model's results will hold more water than behaviour analysis done by a human (which some editors excel at). Theoretically, by feeding such edits into an assessment system similar to ClueBot and having expert sockpuppet hunters assess them would result in a much more accurate and more "dangerous" model, so to say - but with public information, it shouldn't be closed source and probably only stifles innovation (for e.g. GPT-3's eventual release). If the concern is privacy, probably best to dismantle the entire project but again, someone who wants to can simply put it in the hours required to do something similar, so not much point and I think by raising this here, it's probably resulted in the Streisand effect and more people are now aware of your model and it's possible repercussions, although transparency is quite integral in all open-source communities. In the end, it all comes down to your choice, there's no right answer as far as I can tell.

Best, QEDK

On Fri, Aug 7, 2020, 02:19 John Erling Blad jeblad@gmail.com wrote:

...

Nice idea! First time I wrote about this being possible was back in 2008-ish.

The problem is quite trivial, you use some observable feature to fingerprint an adversary. The adversary can then game the system if the observable feature can be somehow changed or modified. To avoid this the observable features are usually chosen to be physical properties that can't be easily changed.

In this case the features are word and/or relations between words, and then the question is “Can the adversary change the choice of words?” Yes he can, because the choice of words is not an inherent physical property of the user. In fact there are several programs that help users express themselves in a more fluent way, and such systems will change the observable features i.e. choice of words. The program will move the observable features (the words) from one user-specific distribution to another more program-specific distribution. You will observe the users a priori to be different, but with the program they will be a posteriori more similar.

A real problem is your own poisoning of the training data. That happens when you find some subject to be the same as your postulated one, and then feed the information back into your training data. If you don't do that your training data will start to rot because humans change over time. It is bad anyway you do it.

Even more fun is an adversary that knows what you are doing, and tries to negate your detection algorithm, or even fool you into believing he is someone else. It is after all nothing more than word count and statistics. What will you do when someone edits a Wikipedia-page and your system tells you “This revision is most likely written by Jimbo”?

Several such programs exist, and I'm a bit perplexed that they are not in more use among Wikipedia's editors. Some of them are more aggressive, and can propose quite radical rewrites of the text. I use one of them, and it is not the best, but still it corrects me all the time.

I believe it would be better to create a system where users are internally identified and externally authenticated. (The previous is biometric identification, and must adhere to privacy laws.)

On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani ladsgroup@gmail.com wrote:

...
Hey, I have an ethical question that I couldn't answer yet and have been

asking

...
around but no definite answer yet so I'm asking it in a larger audience

in

...
hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to

be

...
able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough

resources

...
to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive

topics.

...
Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be

able

...
to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code

that

...
produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f...

...
[2]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f...

...
[3] https://en.wikipedia.org/wiki/Wikipedia:SPI

Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Nathan

5:34 a.m.

I appreciate that Amir is acknowledging that as neat as this tool sounds, its use is fraught with risk. The comparison that immediately jumped to my mind is predictive algorithms used in the criminal justice system to assess risk of bail jumping or criminal recidividism. These algorithms have been largely secret, their use hidden, their conclusions non-public. The more we learn about them, the more deeply flawed it's clear they are. Obviously the real-world consequences of these tools are more severe in that they directly lead to the incarceration of many people, but I think the comparison is illustrative of the risks. It also suggests the type of ongoing comprehensive review that should be involved in making this tool available to users.

The potential misuse here to be concerned about is by amateurs with unsociable intent, or by intended users who are wreckless or ignorant of the risks. Major governments have the resources to easily build this themselves, and if they care enough about fingerprinting Wikipedians they likely already have.

I think if the tool is useful and there's a demand for it, everything about it - how it works, who uses it, what conclusions and actions are taken as a result of its use, etc - should be made public. That's the only way we'll discover the multiple ways in which it will surely eventually be misused. SPI has been using these 'techniques' in a manual way, or with unsophisticated tools, for many years. But like any tool, the data fed into it can be training the system incorrectly. The results it returns can be misunderstood or intentionally misused. Knowledge of its existence will lead the most sophisticated to beat it, or intentionally misdirect it. People who are innocent of any violation of our norms will be harmed by its use. Please establish the proper cultural and procedural safeguards to limit the harm as much as possible.

Federico Leva (Nemo)

10:51 p.m.

Thanks Amir for having this conversation here.

On Nathan's point: outside the Wikimedia projects, we of the free culture movement tend to argue for full transparency on the functioning of "automated decision making", "algorithmic tools" , "forensic software" and so on, typically ensured by open data and free software.

Wikimedia wikis are not a court system, a block is not jail and so on, but for instance EFF just argued that in the USA judiciary certain rights ought to be ensured to respect the Sixth Amendment. https://www.eff.org/deeplinks/2020/07/our-eu-policy-principles-procedural-ju... https://www.eff.org/deeplinks/2020/08/eff-and-aclu-tell-federal-court-forens... https://meta.wikimedia.org/wiki/EU_policy/Consultation_on_the_White_Paper_on_Artificial_Intelligence_(2020)

Federico

Derk-Jan Hartman

4:51 p.m.

As others, I see several problems 1. If the code is public, someone can duplicate it and bypass our internal 'safekeeping', because it uses public data. 2. Risk of misuse by either incompetence or malice 3. Risk of accidentally exposing legitimate sockpuppets even in the most closed off situations. 4. Give ppl insight into how the AI works

My answers to those:

1. I have no problem with keeping this in a private repo (yet technically opensourced) code. We also run private mailinglists and have private repos for configuration secrets. Yes it is a bit of a stretch, but.. IAR. At the same time, from the description, seems like something any AI developer with a bit of determination can reproduce... so... for how long will this matter ? 2. NDA + OAuth access for those who need it. Aggressive action logging of usage of the software. Showing these logs to all users of the tool to enforce social control. "User X investigated the matches of account: Y", User Z investigated match on previously known sockpuppet BlockedQ" 3. Usage wise, I'd have two flows. 1. Matches: Surface 'matches' that match previously known sockpuppets (will require keeping track of that list). Only disclose details of a match upon additional user action (logged). 2. Requests: Enter specific account name(s) and request if there are matches on/between that/those name(s). (logged) Those flows might have different levels match certainty perhaps... If you want to go even further.. Requiring signoff on a request by another user before you can actually view the matches. 4. That does leave you with the problem of how you can give ppl insight into why an AI matched something.. that is a hard problem. I don't know enough about that problem space.

...

On 6 Aug 2020, at 04:33, Amir Sarabadani ladsgroup@gmail.com wrote:

Hey, I have an ethical question that I couldn't answer yet and have been asking around but no definite answer yet so I'm asking it in a larger audience in hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to be able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough resources to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive topics. Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be able to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code that produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [2] https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f... [3] https://en.wikipedia.org/wiki/Wikipedia:SPI -- Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ryan Kaldari

11:38 p.m.

For better or worse, it seems clear that the cat is out of the bag. Identity detection through stylometry is now an established technology and you can easily find code on GitHub or elsewhere (e.g. https://github.com/jabraunlin/reddit-user-id) to accomplish it (if you have the time and energy to build a data set and train the model). Back in 2017, there was even a start-up company that was offering this as a service. Whatever danger is embodied in Amir's code, it's only a matter of time before this danger is ubiquitous. And for the worst-case scenario—governments using the technology to hunt down dissidents—I imagine this is already happening. So while I agree there is a moral consideration to releasing this software, I think the moral implications aren't actually that huge. Eventually, we will just have to accept that creating separate accounts is not an effective way to protect your identity. That said, I think taking precautions to minimize (or at least slow down) the potential abuse of this technology is sensible. TheDJ offered many good suggestions in this vein so I won't repeat them here. Overall though, I think moving ahead with this tool is a good idea and I hope you are able to come to a solution that is amenable to everyone. The WMF is also interested in this technology (as a potential mitigation for IP masking), so the outcome may help inform their work as well.

On Fri, Aug 7, 2020 at 5:51 AM Derk-Jan Hartman < d.j.hartman+wmf_ml@gmail.com> wrote:

...

As others, I see several problems

If the code is public, someone can duplicate it and bypass our internal

'safekeeping', because it uses public data. 2. Risk of misuse by either incompetence or malice 3. Risk of accidentally exposing legitimate sockpuppets even in the most closed off situations. 4. Give ppl insight into how the AI works

My answers to those:

I have no problem with keeping this in a private repo (yet technically

opensourced) code. We also run private mailinglists and have private repos for configuration secrets. Yes it is a bit of a stretch, but.. IAR. At the same time, from the description, seems like something any AI developer with a bit of determination can reproduce... so... for how long will this matter ? 2. NDA + OAuth access for those who need it. Aggressive action logging of usage of the software. Showing these logs to all users of the tool to enforce social control. "User X investigated the matches of account: Y", User Z investigated match on previously known sockpuppet BlockedQ" 3. Usage wise, I'd have two flows. 1. Matches: Surface 'matches' that match previously known sockpuppets (will require keeping track of that list). Only disclose details of a match upon additional user action (logged). 2. Requests: Enter specific account name(s) and request if there are matches on/between that/those name(s). (logged) Those flows might have different levels match certainty perhaps... If you want to go even further.. Requiring signoff on a request by another user before you can actually view the matches. 4. That does leave you with the problem of how you can give ppl insight into why an AI matched something.. that is a hard problem. I don't know enough about that problem space.

DJ

...
On 6 Aug 2020, at 04:33, Amir Sarabadani ladsgroup@gmail.com wrote:

Hey, I have an ethical question that I couldn't answer yet and have been

asking

...
around but no definite answer yet so I'm asking it in a larger audience

in

...
hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to

be

...
able to catch sock puppets (two users pretending to be different but actually the same person). It's based on the way they speak. The way we speak is like a fingerprint and it's unique to us and it's really hard to forge or change on demand (unlike IP/UA), as the result if you apply some basic techniques in AI on Wikipedia discussions (which can be really lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared two pairs of users, one pair that are not sock puppets and the other is a pair of known socks (a user who got banned indefinitely but came back hidden under another username). [1][2] These graphs are based one of several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us understand and catch socks. Especially the ones that have enough

resources

...
to change their IP/UA regularly (like sock farms, and/or UPEs) and also with the increase of mobile intern providers and the horrible way they assign IP to their users, this can get really handy in some SPI ("Sock puppet investigation") [3] cases.

The problem is that this tool, while being built only on public information, actually has the power to expose legitimate sock puppets. People who live under oppressive governments and edit on sensitive

topics.

...
Disclosing such connections between two accounts can cost people their lives.

So, this code is not going to be public, period. But we need to have this code in Wikimedia Cloud Services so people like CUs in other wikis be

able

...
to use it as a web-based tool instead of me running it for them upon request. But WMCS terms of use explicitly say code should never be closed-source and this is our principle. What should we do? I pay a corporate cloud provider for this and put such important code and data there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a shell of a code that would be useless without data, and keep the code

that

...
produces the data (out of dumps) closed (which is fine, running that code is not too hard even on enwiki) and update the data myself. This might be doable (which I'm around 30% sure, it still might expose too much) but it wouldn't cover future cases similar to mine and I think a more long-term solution is needed here. Also, it would reduce the bus factor to 1, and maintenance would be complicated.

What should we do?

Thanks [1]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f...

...
[2]

https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_f...

...
[3] https://en.wikipedia.org/wiki/Wikipedia:SPI

Amir (he/him) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Amir Sarabadani

9 Aug 9 Aug

12:41 a.m.

Thank you all for the responses, I try to summarize my responses here.

* By closed source, I don't mean it will be only accessible to me, It's already accessible by another CU and one WMF staff, and I would gladly share the code with anyone who has signed NDA and they are of course more than welcome to change it. Github has a really low limit for people who can access a private repo but I would be fine with any means to fix this.

* I have read that people say that there are already public tools to analyze text. I disagree, 1- The tools you mentioned are for English and not other languages (maybe I missed something) and even if we imagine there would be such tools for big languages like German and/or French, they don't cover lots of languages unlike my tool that's basically language agnostic and depends on the volume of discussions happened in the wiki.

* I also disagree that it's not hard to build. I have lots of experience with NLP (with my favorite work being a tool that finds swear words in every language based on history of vandalism in that Wikipedia [1]) and still it took me more than a year (a couple of hours almost in every weekend) to build this, analyzing a pure clean text is not hard, cleaning up wikitext and templates and links to get only text people "spoke" is doubly hard, analyzing user signatures brings only suffer and sorrow.

* While in general I agree if a government wants to build this, they can but reality is more complicated and this situation is similar to security. You can never be 100% secure but you can increase the cost of hacking you so much that it would be pointless for a major actor to do it. Governments have a limited budget and dictatorships are by design corrupt and filled with incompotent people [2] and sanctions put another restrain on such governments too so I would not give them such opportunity for oppersion in a silver plate for free, if they really want to, then they must pay for it (which means they can't use that money/resources on oppersing some other groups).

* People have said this AI is easy to be gamed, while it's not that easy and the tools you mentioned are limited to English, it's still a big win for the integrity of our projects. It boils down again to increasing the cost. If a major actor wants to spread disinformation, so far they only need to fake their UA and IP which is a piece of cake and I already see that (as a CU) but now they have to mess with UA/IP AND change their methods of speaking (which is one order of magnitude harder than changing IP). As I said, increasing this cost might not prevent it from happening but at least it takes away the ability of oppressing other groups.

* This tool never will be the only reason to block a sock. It's more than anything a helper, if CU brings a large range and they are similar but the result is not conclusive, this tool can help. Or when we are 90% sure it's a WP:DUCK, this tool can help too but blocking just because this tool said so would imply a "Minority report" situation and to be honest and I would really like to avoid that. It is supposed to empower CUs.

* Banning using this tool is not possible legally, the content of Wikipedia is published under CC-BY-SA and this allows such analysis specially you can't ban an offwiki action. Also, if a university professor can do it, I don't see the point of banning using it by the most trusted group of users (CUs). You can ban blocking based on this tool but I don't think we should block solely based on this anyway.

* It has been pointed out by people in the checkuser mailing list that there's no point in logging accessing this tool, since the code is accessible to CUs (if they want to), so they can download and run it on their computer without logging anyway.

* There is a huge difference between CU and this AI tool in matters of privacy. While both are privacy sensitive but CU reveals much more, as a CU, I know where lots of people are living or studying because they showed up in my CUs and while I won't tell a soul about them but it makes me uncomfortable (I'm also not implying CUs are not trusted, it's just we should respect people's privacy and avoid "unreasonable search and seizure"[3]) but this tool only reveals a connection between accounts if one of them is linked to a public identity and the other is not which I wholeheartedly agree is not great but it's not on the same level as seeing people's IPs. So I even think in an ideal world where the AI model is more accurate than CU, we should stop using CU and rely solely on the AI instead (important: I'm not implying the current model is better, I'm saying if it was better). This would help us understand why for example fishing for sock puppets with CU is bad (and banned by the policy) but fishing for socks using this AI is not bad and can be a good starting point. In other words, this tool being used right, can reduce check user actions and protect people's privacy instead.

* People have been saying you need to teach AI to people so for example CUs don't make wrong judgments based on this. I want to point out the examples mentioned in the discussion are supervised machine learning which is AI but not all of AI. This tool is not machine learning but it's AI (by heavily relying on NLP) and for example it produces graphs and etc. and it wouldn't give a number like "95% sure these two users are the same" which a supervised machine learning model would do. I think reducing fingerprints of people to just a number is inaccurate and harmful (life is not like a TV crime series where a forensic scientist gives you the truth using some magic). I write a detailed instruction on how to use it but it's not as bad as you'd think, I leave a huge room for human judgment.

[1] Have fun (warning, explicit language): https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki [2] For knowing why, you can read this book on political science called "The Dictator's handbook": https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook [3] From the fourth amendment of US constitution, you can find a similar clause in every constitution.

Hope this responds to some concerns. Sorry for a long email.

John Erling Blad

4:44 a.m.

Please stop calling this an “AI” system, it is not. It is statistical learning.

This is probably not going to make me popular…

In some jurisdictions you will need a permit to create, manage, and store biometric identifiers, no matter if the biometric identifier is for a known person or not. If you want to create biometric identifiers, and use them, make darn sure you follow every applicable law and rule. I'm not amused by the idea of having CUs using illegal tools to wet ordinary users.

Any system that tries to remove anonymity og users on Wikipedia should have an RfC where the community can make their concerns heard. This is not the proper forum to get acceptance from Wikipedias community.

And btw, systems for cleanup of prose exists for a whole bunch of languages, not only English. Grammarly is one, LanguageTool another, and there are a whole bunch other such tools.

lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani ladsgroup@gmail.com:

...

Thank you all for the responses, I try to summarize my responses here.

By closed source, I don't mean it will be only accessible to me, It's

already accessible by another CU and one WMF staff, and I would gladly share the code with anyone who has signed NDA and they are of course more than welcome to change it. Github has a really low limit for people who can access a private repo but I would be fine with any means to fix this.

I have read that people say that there are already public tools to

analyze text. I disagree, 1- The tools you mentioned are for English and not other languages (maybe I missed something) and even if we imagine there would be such tools for big languages like German and/or French, they don't cover lots of languages unlike my tool that's basically language agnostic and depends on the volume of discussions happened in the wiki.

I also disagree that it's not hard to build. I have lots of experience

with NLP (with my favorite work being a tool that finds swear words in every language based on history of vandalism in that Wikipedia [1]) and still it took me more than a year (a couple of hours almost in every weekend) to build this, analyzing a pure clean text is not hard, cleaning up wikitext and templates and links to get only text people "spoke" is doubly hard, analyzing user signatures brings only suffer and sorrow.

While in general I agree if a government wants to build this, they can

but reality is more complicated and this situation is similar to security. You can never be 100% secure but you can increase the cost of hacking you so much that it would be pointless for a major actor to do it. Governments have a limited budget and dictatorships are by design corrupt and filled with incompotent people [2] and sanctions put another restrain on such governments too so I would not give them such opportunity for oppersion in a silver plate for free, if they really want to, then they must pay for it (which means they can't use that money/resources on oppersing some other groups).

People have said this AI is easy to be gamed, while it's not that easy

and the tools you mentioned are limited to English, it's still a big win for the integrity of our projects. It boils down again to increasing the cost. If a major actor wants to spread disinformation, so far they only need to fake their UA and IP which is a piece of cake and I already see that (as a CU) but now they have to mess with UA/IP AND change their methods of speaking (which is one order of magnitude harder than changing IP). As I said, increasing this cost might not prevent it from happening but at least it takes away the ability of oppressing other groups.

This tool never will be the only reason to block a sock. It's more than

anything a helper, if CU brings a large range and they are similar but the result is not conclusive, this tool can help. Or when we are 90% sure it's a WP:DUCK, this tool can help too but blocking just because this tool said so would imply a "Minority report" situation and to be honest and I would really like to avoid that. It is supposed to empower CUs.

Banning using this tool is not possible legally, the content of Wikipedia

is published under CC-BY-SA and this allows such analysis specially you can't ban an offwiki action. Also, if a university professor can do it, I don't see the point of banning using it by the most trusted group of users (CUs). You can ban blocking based on this tool but I don't think we should block solely based on this anyway.

It has been pointed out by people in the checkuser mailing list that

there's no point in logging accessing this tool, since the code is accessible to CUs (if they want to), so they can download and run it on their computer without logging anyway.

There is a huge difference between CU and this AI tool in matters of

privacy. While both are privacy sensitive but CU reveals much more, as a CU, I know where lots of people are living or studying because they showed up in my CUs and while I won't tell a soul about them but it makes me uncomfortable (I'm also not implying CUs are not trusted, it's just we should respect people's privacy and avoid "unreasonable search and seizure"[3]) but this tool only reveals a connection between accounts if one of them is linked to a public identity and the other is not which I wholeheartedly agree is not great but it's not on the same level as seeing people's IPs. So I even think in an ideal world where the AI model is more accurate than CU, we should stop using CU and rely solely on the AI instead (important: I'm not implying the current model is better, I'm saying if it was better). This would help us understand why for example fishing for sock puppets with CU is bad (and banned by the policy) but fishing for socks using this AI is not bad and can be a good starting point. In other words, this tool being used right, can reduce check user actions and protect people's privacy instead.

People have been saying you need to teach AI to people so for example CUs

don't make wrong judgments based on this. I want to point out the examples mentioned in the discussion are supervised machine learning which is AI but not all of AI. This tool is not machine learning but it's AI (by heavily relying on NLP) and for example it produces graphs and etc. and it wouldn't give a number like "95% sure these two users are the same" which a supervised machine learning model would do. I think reducing fingerprints of people to just a number is inaccurate and harmful (life is not like a TV crime series where a forensic scientist gives you the truth using some magic). I write a detailed instruction on how to use it but it's not as bad as you'd think, I leave a huge room for human judgment.

[1] Have fun (warning, explicit language): https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki [2] For knowing why, you can read this book on political science called "The Dictator's handbook": https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook [3] From the fourth amendment of US constitution, you can find a similar clause in every constitution.

Hope this responds to some concerns. Sorry for a long email. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

bawolff

6:14 a.m.

On Sat, Aug 8, 2020 at 9:44 PM John Erling Blad jeblad@gmail.com wrote:

...

Please stop calling this an “AI” system, it is not. It is statistical learning.

So in other words, it is an AI system? AI is just a colloquial synonym for statistical learning at this point.

-- Brian

Nathan

7:18 a.m.

For my part, I think Amir is going way above and beyond to be so thoughtful and open about the future of his tool.

I don't see how any part of it constitutes creating biometric identifiers, nor is it obvious to me how it must remove anonymity of users.

John, perhaps you can elaborate on your reasoning there?

Ultimately I don't think community approval for this tool is technically required. I appreciate the effort to solicit input and don't think it would hurt to do that more broadly.

On Sat, Aug 8, 2020 at 5:44 PM John Erling Blad jeblad@gmail.com wrote:

...

Please stop calling this an “AI” system, it is not. It is statistical learning.

This is probably not going to make me popular…

In some jurisdictions you will need a permit to create, manage, and store biometric identifiers, no matter if the biometric identifier is for a known person or not. If you want to create biometric identifiers, and use them, make darn sure you follow every applicable law and rule. I'm not amused by the idea of having CUs using illegal tools to wet ordinary users.

Any system that tries to remove anonymity og users on Wikipedia should have an RfC where the community can make their concerns heard. This is not the proper forum to get acceptance from Wikipedias community.

And btw, systems for cleanup of prose exists for a whole bunch of languages, not only English. Grammarly is one, LanguageTool another, and there are a whole bunch other such tools.

lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani ladsgroup@gmail.com:

...
Thank you all for the responses, I try to summarize my responses here.

By closed source, I don't mean it will be only accessible to me, It's

already accessible by another CU and one WMF staff, and I would gladly share the code with anyone who has signed NDA and they are of course more than welcome to change it. Github has a really low limit for people who

can

...
access a private repo but I would be fine with any means to fix this.

I have read that people say that there are already public tools to

analyze text. I disagree, 1- The tools you mentioned are for English and not other languages (maybe I missed something) and even if we imagine

there

...
would be such tools for big languages like German and/or French, they

don't

...
cover lots of languages unlike my tool that's basically language agnostic and depends on the volume of discussions happened in the wiki.

I also disagree that it's not hard to build. I have lots of experience

with NLP (with my favorite work being a tool that finds swear words in every language based on history of vandalism in that Wikipedia [1]) and still it took me more than a year (a couple of hours almost in every weekend) to build this, analyzing a pure clean text is not hard, cleaning up wikitext and templates and links to get only text people "spoke" is doubly hard, analyzing user signatures brings only suffer and sorrow.

While in general I agree if a government wants to build this, they can

but reality is more complicated and this situation is similar to

security.

...
You can never be 100% secure but you can increase the cost of hacking you so much that it would be pointless for a major actor to do it.

Governments

...
have a limited budget and dictatorships are by design corrupt and filled with incompotent people [2] and sanctions put another restrain on such governments too so I would not give them such opportunity for oppersion

in

...
a silver plate for free, if they really want to, then they must pay for

it

...
(which means they can't use that money/resources on oppersing some other groups).

People have said this AI is easy to be gamed, while it's not that easy

and the tools you mentioned are limited to English, it's still a big win for the integrity of our projects. It boils down again to increasing the cost. If a major actor wants to spread disinformation, so far they only need to fake their UA and IP which is a piece of cake and I already see that (as a CU) but now they have to mess with UA/IP AND change their methods of speaking (which is one order of magnitude harder than changing IP). As I said, increasing this cost might not prevent it from happening but at least it takes away the ability of oppressing other groups.

This tool never will be the only reason to block a sock. It's more than

anything a helper, if CU brings a large range and they are similar but

the

...
result is not conclusive, this tool can help. Or when we are 90% sure

it's

...
a WP:DUCK, this tool can help too but blocking just because this tool

said

...
so would imply a "Minority report" situation and to be honest and I would really like to avoid that. It is supposed to empower CUs.

Banning using this tool is not possible legally, the content of

Wikipedia

...
is published under CC-BY-SA and this allows such analysis specially you can't ban an offwiki action. Also, if a university professor can do it, I don't see the point of banning using it by the most trusted group of

users

...
(CUs). You can ban blocking based on this tool but I don't think we

should

...
block solely based on this anyway.

It has been pointed out by people in the checkuser mailing list that

there's no point in logging accessing this tool, since the code is accessible to CUs (if they want to), so they can download and run it on their computer without logging anyway.

There is a huge difference between CU and this AI tool in matters of

privacy. While both are privacy sensitive but CU reveals much more, as a CU, I know where lots of people are living or studying because they

showed

...
up in my CUs and while I won't tell a soul about them but it makes me uncomfortable (I'm also not implying CUs are not trusted, it's just we should respect people's privacy and avoid "unreasonable search and seizure"[3]) but this tool only reveals a connection between accounts if one of them is linked to a public identity and the other is not which I wholeheartedly agree is not great but it's not on the same level as

seeing

...
people's IPs. So I even think in an ideal world where the AI model is

more

...
accurate than CU, we should stop using CU and rely solely on the AI

instead

...
(important: I'm not implying the current model is better, I'm saying if

it

...
was better). This would help us understand why for example fishing for

sock

...
puppets with CU is bad (and banned by the policy) but fishing for socks using this AI is not bad and can be a good starting point. In other

words,

...
this tool being used right, can reduce check user actions and protect people's privacy instead.

People have been saying you need to teach AI to people so for example

CUs

...
don't make wrong judgments based on this. I want to point out the

examples

...
mentioned in the discussion are supervised machine learning which is AI

but

...
not all of AI. This tool is not machine learning but it's AI (by heavily relying on NLP) and for example it produces graphs and etc. and it

wouldn't

...
give a number like "95% sure these two users are the same" which a supervised machine learning model would do. I think reducing fingerprints of people to just a number is inaccurate and harmful (life is not like a

TV

...
crime series where a forensic scientist gives you the truth using some magic). I write a detailed instruction on how to use it but it's not as

bad

...
as you'd think, I leave a huge room for human judgment.

[1] Have fun (warning, explicit language): https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki [2] For knowing why, you can read this book on political science called "The Dictator's handbook": https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook [3] From the fourth amendment of US constitution, you can find a similar clause in every constitution.

Hope this responds to some concerns. Sorry for a long email. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gergo Tisza

8:37 p.m.

On Sun, Aug 9, 2020 at 2:18 AM Nathan nawrich@gmail.com wrote:

...

I don't see how any part of it constitutes creating biometric identifiers, nor is it obvious to me how it must remove anonymity of users.

The GDPR for example defines biometric data as "personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person" (Art. 4 (14)). That seems to fit, although it could be argued that the tool would link accounts to other accounts and not to people so the data is not used for "identification of a natural person", but that does not sound super convincing. The GDPR (Art 9) generally forbids processing biometric data, except for a number of special cases, some of which can be argued to apply:

* processing is carried out in the course of its legitimate activities with appropriate safeguards by a foundation, association or any other not-for-profit body with a political, philosophical, religious or trade union aim and on condition that the processing relates solely to the members or to former members of the body or to persons who have regular contact with it in connection with its purposes and that the personal data are not disclosed outside that body without the consent of the data subjects; * processing relates to personal data which are manifestly made public by the data subject;

but I wouldn't say it's clear-cut.

Gergő Tisza

8:45 p.m.

FWIW, the movement strategy included a recommendation* for having a technology ethics review process [1]; maybe this is a good opportunity to experiment with creating a precursory, unofficial version of that - make a wiki page for the sock puppet detection tool, and a proposal process for such pages, and consider where we could source expert advice from.

* More precisely, it was a draft recommendation. The final recommendations were significantly less fine-grained. [1] https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2018-20/Recommen...

Gergő Tisza

5:51 p.m.

On Sat, Aug 8, 2020 at 7:43 PM Amir Sarabadani ladsgroup@gmail.com wrote:

...

By closed source, I don't mean it will be only accessible to me, It's

already accessible by another CU and one WMF staff, and I would gladly share the code with anyone who has signed NDA and they are of course more than welcome to change it. Github has a really low limit for people who can access a private repo but I would be fine with any means to fix this.

Closed source is commonly understood to mean the code is not under an OSI-approved open-source license (such code is banned from Toolforge). Contrary to common misconceptions, many OSI-approved open-source licenses (such as the GPL) allow keeping the code private, as long as the software itself is also kept private. IMO it would be less confusing to use the "public"/"private" terminology here - yes the code should be open-sourced, but that's mostly orthogonal to the concerns discussed here.

* It has been pointed out by people in the checkuser mailing list that

...

there's no point in logging accessing this tool, since the code is accessible to CUs (if they want to), so they can download and run it on their computer without logging anyway.

There's a significant difference between your actions not being logged vs. your actions being logged unless you actively circumvent the logging (in ways which would probably seem malicious). Clear red lines work well in a community project even when there's nothing physically stopping people from stepping over them.

* There is a huge difference between CU and this AI tool in matters of

...

privacy. While both are privacy sensitive but CU reveals much more, as a CU, I know where lots of people are living or studying because they showed up in my CUs (...) but this tool only reveals a connection between accounts if one of them is linked to a public identity and the other is not which I wholeheartedly agree is not great but it's not on the same level as seeing people's IPs.

On the other hand, IP checks are very unreliable. A hypothetical tool that is reliable would be a bigger privacy concern, since it would be used more often and more successfully to extract private details. (On the other other hand, as a Wikipedia editor I have a reasonable expectation of privacy of the site not telling its administrators where I live. Do I have a reasonable expectation of privacy for not telling them what my alt accounts are? Arguably not.)

Also, how much help would such a tool be in off-wiki stylometry? If it can be used (on its own or with additional tooling) to connect wiki accounts to other online accounts, that would subjectively seem to me to have a significantly larger privacy impact than IP addresses.

Gergő Tisza

5:56 p.m.

On Fri, Aug 7, 2020 at 6:39 PM Ryan Kaldari rkaldari@wikimedia.org wrote:

...

Whatever danger is embodied in Amir's code, it's only a matter of time before this danger is ubiquitous. And for the worst-case scenario—governments using the technology to hunt down dissidents—I imagine this is already happening. So while I agree there is a moral consideration to releasing this software, I think the moral implications aren't actually that huge. Eventually, we will just have to accept that creating separate accounts is not an effective way to protect your identity.

Deanonymizing wiki accounts is one way of misusing the tool, and one which would indeed happen anyway. Another scenario is an attacker examining the tool with the intent of misleading it (such as using an adversarial network to construct edits which the tool would consistently misidentify as belonging to a certain user, which could be used to cast suspicion on a legitimate user). That specifically depends on the model being publicly available.

1568

Age (days ago)

1571

Last active (days ago)

wikitech-l@lists.wikimedia.org

19 comments

12 participants

tags (0)

participants (12)

Amir Sarabadani
AntiCompositeNumber
bawolff
Derk-Jan Hartman
Federico Leva (Nemo)
Gergo Tisza
Gergő Tisza
John Erling Blad
Nathan
QEDK
Ryan Kaldari
Thiemo Kreuz