Re: [Wikitech-l] Ethical question regarding some code

6 Aug 2020

For those interested; the best solution as far as I know for this kind of
similarity detection is the Siamese network with RNNs in the first part.
That implies you must extract fingerprints for all likely candidates
(users) and then some to create a baseline. You can not simply claim that
two users (adversary and postulated sock) are the same because they have
edited the same page. It is quite unlikely a user will edit the same page
with a sock puppet, when it is known that such a system is activated.

On Thu, Aug 6, 2020 at 10:49 PM John Erling Blad &lt;jeblad(a)gmail.com&gt; wrote:

...
  Nice idea! First time I wrote about this being
possible was back in
 2008-ish.

 The problem is quite trivial, you use some observable feature to
 fingerprint an adversary. The adversary can then game the system if the
 observable feature can be somehow changed or modified. To avoid this the
 observable features are usually chosen to be physical properties that can't
 be easily changed.

 In this case the features are word and/or relations between words, and
 then the question is “Can the adversary change the choice of words?” Yes he
 can, because the choice of words is not an inherent physical property of
 the user. In fact there are several programs that help users express
 themselves in a more fluent way, and such systems will change the
 observable features i.e. choice of words. The program will move the
 observable features (the words) from one user-specific distribution to
 another more program-specific distribution. You will observe the users a
 priori to be different, but with the program they will be a posteriori more
 similar.

 A real problem is your own poisoning of the training data. That happens
 when you find some subject to be the same as your postulated one, and then
 feed the information back into your training data. If you don't do that
 your training data will start to rot because humans change over time. It is
 bad anyway you do it.

 Even more fun is an adversary that knows what you are doing, and tries to
 negate your detection algorithm, or even fool you into believing he is
 someone else. It is after all nothing more than word count and statistics.
 What will you do when someone edits a Wikipedia-page and your system tells
 you “This revision is most likely written by Jimbo”?

 Several such programs exist, and I'm a bit perplexed that they are not in
 more use among Wikipedia's editors. Some of them are more aggressive, and
 can propose quite radical rewrites of the text. I use one of them, and it
 is not the best, but still it corrects me all the time.

 I believe it would be better to create a system where users are internally
 identified and externally authenticated. (The previous is biometric
 identification, and must adhere to privacy laws.)

 On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani &lt;ladsgroup(a)gmail.com&gt;
 wrote:

  Hey,
 I have an ethical question that I couldn't answer yet and have been asking
 around but no definite answer yet so I'm asking it in a larger audience in
 hope of a solution.

 For almost a year now, I have been developing an NLP-based AI system to be
 able to catch sock puppets (two users pretending to be different but
 actually the same person). It's based on the way they speak. The way we
 speak is like a fingerprint and it's unique to us and it's really hard to
 forge or change on demand (unlike IP/UA), as the result if you apply some
 basic techniques in AI on Wikipedia discussions (which can be really
 lengthy, trust me), the datasets and sock puppets shine.

 Here's an example, I highly recommend looking at these graphs, I compared
 two pairs of users, one pair that are not sock puppets and the other is a
 pair of known socks (a user who got banned indefinitely but came back
 hidden under another username). [1][2] These graphs are based one of
 several aspects of this AI system.

 I have talked about this with WMF and other CUs to build and help us
 understand and catch socks. Especially the ones that have enough resources
 to change their IP/UA regularly (like sock farms, and/or UPEs) and also
 with the increase of mobile intern providers and the horrible way they
 assign IP to their users, this can get really handy in some SPI ("Sock
 puppet investigation") [3] cases.

 The problem is that this tool, while being built only on public
 information, actually has the power to expose legitimate sock puppets.
 People who live under oppressive governments and edit on sensitive topics.
 Disclosing such connections between two accounts can cost people their
 lives.

 So, this code is not going to be public, period. But we need to have this
 code in Wikimedia Cloud Services so people like CUs in other wikis be able
 to use it as a web-based tool instead of me running it for them upon
 request. But WMCS terms of use explicitly say code should never be
 closed-source and this is our principle. What should we do? I pay a
 corporate cloud provider for this and put such important code and data
 there? We amend the terms of use to have some exceptions like this one?

 The most plausible solution suggested so far (thanks Huji) is to have a
 shell of a code that would be useless without data, and keep the code that
 produces the data (out of dumps) closed (which is fine, running that code
 is not too hard even on enwiki) and update the data myself. This might be
 doable (which I'm around 30% sure, it still might expose too much) but it
 wouldn't cover future cases similar to mine and I think a more long-term
 solution is needed here. Also, it would reduce the bus factor to 1, and
 maintenance would be complicated.

 What should we do?

 Thanks
 [1]

 https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_…
 [2]

 https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_…
 [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
 --
 Amir (he/him)
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Ethical question regarding some code