Re: [Wikitech-l] Ethical question regarding some code

8 Aug 2020

Please stop calling this an “AI” system, it is not. It is statistical
learning.

This is probably not going to make me popular…

In some jurisdictions you will need a permit to create, manage, and store
biometric identifiers, no matter if the biometric identifier is for a known
person or not. If you want to create biometric identifiers, and use them,
make darn sure you follow every applicable law and rule. I'm not amused by
the idea of having CUs using illegal tools to wet ordinary users.

Any system that tries to remove anonymity og users on Wikipedia should have
an RfC where the community can make their concerns heard. This is not the
proper forum to get acceptance from Wikipedias community.

And btw, systems for cleanup of prose exists for a whole bunch of
languages, not only English. Grammarly is one, LanguageTool another, and
there are a whole bunch other such tools.

lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani &lt;ladsgroup(a)gmail.com&gt;om>:

...
  Thank you all for the responses, I try to summarize my
responses here.

 * By closed source, I don't mean it will be only accessible to me, It's
 already accessible by another CU and one WMF staff, and I would gladly
 share the code with anyone who has signed NDA and they are of course more
 than welcome to change it. Github has a really low limit for people who can
 access a private repo but I would be fine with any means to fix this.

 * I have read that people say that there are already public tools to
 analyze text. I disagree, 1- The tools you mentioned are for English and
 not other languages (maybe I missed something) and even if we imagine there
 would be such tools for big languages like German and/or French, they don't
 cover lots of languages unlike my tool that's basically language agnostic
 and depends on the volume of discussions happened in the wiki.

 * I also disagree that it's not hard to build. I have lots of experience
 with NLP (with my favorite work being a tool that finds swear words in
 every language based on history of vandalism in that Wikipedia [1]) and
 still it took me more than a year (a couple of hours almost in every
 weekend) to build this, analyzing a pure clean text is not hard, cleaning
 up wikitext and templates and links to get only text people "spoke" is
 doubly hard, analyzing user signatures brings only suffer and sorrow.

 * While in general I agree if a government wants to build this, they can
 but reality is more complicated and this situation is similar to security.
 You can never be 100% secure but you can increase the cost of hacking you
 so much that it would be pointless for a major actor to do it. Governments
 have a limited budget and dictatorships are by design corrupt and filled
 with incompotent people [2] and sanctions put another restrain on such
 governments too so I would not give them such opportunity for oppersion in
 a silver plate for free, if they really want to, then they must pay for it
 (which means they can't use that money/resources on oppersing some other
 groups).

 * People have said this AI is easy to be gamed, while it's not that easy
 and the tools you mentioned are limited to English, it's still a big win
 for the integrity of our projects. It boils down again to increasing the
 cost. If a major actor wants to spread disinformation, so far they only
 need to fake their UA and IP which is a piece of cake and I already see
 that (as a CU) but now they have to mess with UA/IP AND change their
 methods of speaking (which is one order of magnitude harder than changing
 IP). As I said, increasing this cost might not prevent it from happening
 but at least it takes away the ability of oppressing other groups.

 * This tool never will be the only reason to block a sock. It's more than
 anything a helper, if CU brings a large range and they are similar but the
 result is not conclusive, this tool can help. Or when we are 90% sure it's
 a WP:DUCK, this tool can help too but blocking just because this tool said
 so would imply a "Minority report" situation and to be honest and I would
 really like to avoid that. It is supposed to empower CUs.

 * Banning using this tool is not possible legally, the content of Wikipedia
 is published under CC-BY-SA and this allows such analysis specially you
 can't ban an offwiki action. Also, if a university professor can do it, I
 don't see the point of banning using it by the most trusted group of users
 (CUs). You can ban blocking based on this tool but I don't think we should
 block solely based on this anyway.

 * It has been pointed out by people in the checkuser mailing list that
 there's no point in logging accessing this tool, since the code is
 accessible to CUs (if they want to), so they can download and run it on
 their computer without logging anyway.

 * There is a huge difference between CU and this AI tool in matters of
 privacy. While both are privacy sensitive but CU reveals much more, as a
 CU, I know where lots of people are living or studying because they showed
 up in my CUs and while I won't tell a soul about them but it makes me
 uncomfortable (I'm also not implying CUs are not trusted, it's just we
 should respect people's privacy and avoid "unreasonable search and
 seizure"[3]) but this tool only reveals a connection between accounts if
 one of them is linked to a public identity and the other is not which I
 wholeheartedly agree is not great but it's not on the same level as seeing
 people's IPs. So I even think in an ideal world where the AI model is more
 accurate than CU, we should stop using CU and rely solely on the AI instead
 (important: I'm not implying the current model is better, I'm saying if it
 was better). This would help us understand why for example fishing for sock
 puppets with CU is bad (and banned by the policy) but fishing for socks
 using this AI is not bad and can be a good starting point. In other words,
 this tool being used right, can reduce check user actions and protect
 people's privacy instead.

 * People have been saying you need to teach AI to people so for example CUs
 don't make wrong judgments based on this. I want to point out the examples
 mentioned in the discussion are supervised machine learning which is AI but
 not all of AI. This tool is not machine learning but it's AI (by heavily
 relying on NLP) and for example it produces graphs and etc. and it wouldn't
 give a number like "95% sure these two users are the same" which a
 supervised machine learning model would do. I think reducing fingerprints
 of people to just a number is inaccurate and harmful (life is not like a TV
 crime series where a forensic scientist gives you the truth using some
 magic). I write a detailed instruction on how to use it but it's not as bad
 as you'd think, I leave a huge room for human judgment.

 [1] Have fun (warning, explicit language):
 https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki
 [2] For knowing why, you can read this book on political science called
 "The Dictator's handbook":
 https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook
 [3] From the fourth amendment of US constitution, you can find a similar
 clause in every constitution.

 Hope this responds to some concerns. Sorry for a long email.
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Ethical question regarding some code