Please stop calling this an “AI” system, it is not. It is statistical
learning.
This is probably not going to make me popular…
In some jurisdictions you will need a permit to create, manage, and store
biometric identifiers, no matter if the biometric identifier is for a known
person or not. If you want to create biometric identifiers, and use them,
make darn sure you follow every applicable law and rule. I'm not amused by
the idea of having CUs using illegal tools to wet ordinary users.
Any system that tries to remove anonymity og users on Wikipedia should have
an RfC where the community can make their concerns heard. This is not the
proper forum to get acceptance from Wikipedias community.
And btw, systems for cleanup of prose exists for a whole bunch of
languages, not only English. Grammarly is one, LanguageTool another, and
there are a whole bunch other such tools.
lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani <ladsgroup(a)gmail.com>om>:
Thank you all for the responses, I try to summarize my
responses here.
* By closed source, I don't mean it will be only accessible to me, It's
already accessible by another CU and one WMF staff, and I would gladly
share the code with anyone who has signed NDA and they are of course more
than welcome to change it. Github has a really low limit for people who can
access a private repo but I would be fine with any means to fix this.
* I have read that people say that there are already public tools to
analyze text. I disagree, 1- The tools you mentioned are for English and
not other languages (maybe I missed something) and even if we imagine there
would be such tools for big languages like German and/or French, they don't
cover lots of languages unlike my tool that's basically language agnostic
and depends on the volume of discussions happened in the wiki.
* I also disagree that it's not hard to build. I have lots of experience
with NLP (with my favorite work being a tool that finds swear words in
every language based on history of vandalism in that Wikipedia [1]) and
still it took me more than a year (a couple of hours almost in every
weekend) to build this, analyzing a pure clean text is not hard, cleaning
up wikitext and templates and links to get only text people "spoke" is
doubly hard, analyzing user signatures brings only suffer and sorrow.
* While in general I agree if a government wants to build this, they can
but reality is more complicated and this situation is similar to security.
You can never be 100% secure but you can increase the cost of hacking you
so much that it would be pointless for a major actor to do it. Governments
have a limited budget and dictatorships are by design corrupt and filled
with incompotent people [2] and sanctions put another restrain on such
governments too so I would not give them such opportunity for oppersion in
a silver plate for free, if they really want to, then they must pay for it
(which means they can't use that money/resources on oppersing some other
groups).
* People have said this AI is easy to be gamed, while it's not that easy
and the tools you mentioned are limited to English, it's still a big win
for the integrity of our projects. It boils down again to increasing the
cost. If a major actor wants to spread disinformation, so far they only
need to fake their UA and IP which is a piece of cake and I already see
that (as a CU) but now they have to mess with UA/IP AND change their
methods of speaking (which is one order of magnitude harder than changing
IP). As I said, increasing this cost might not prevent it from happening
but at least it takes away the ability of oppressing other groups.
* This tool never will be the only reason to block a sock. It's more than
anything a helper, if CU brings a large range and they are similar but the
result is not conclusive, this tool can help. Or when we are 90% sure it's
a WP:DUCK, this tool can help too but blocking just because this tool said
so would imply a "Minority report" situation and to be honest and I would
really like to avoid that. It is supposed to empower CUs.
* Banning using this tool is not possible legally, the content of Wikipedia
is published under CC-BY-SA and this allows such analysis specially you
can't ban an offwiki action. Also, if a university professor can do it, I
don't see the point of banning using it by the most trusted group of users
(CUs). You can ban blocking based on this tool but I don't think we should
block solely based on this anyway.
* It has been pointed out by people in the checkuser mailing list that
there's no point in logging accessing this tool, since the code is
accessible to CUs (if they want to), so they can download and run it on
their computer without logging anyway.
* There is a huge difference between CU and this AI tool in matters of
privacy. While both are privacy sensitive but CU reveals much more, as a
CU, I know where lots of people are living or studying because they showed
up in my CUs and while I won't tell a soul about them but it makes me
uncomfortable (I'm also not implying CUs are not trusted, it's just we
should respect people's privacy and avoid "unreasonable search and
seizure"[3]) but this tool only reveals a connection between accounts if
one of them is linked to a public identity and the other is not which I
wholeheartedly agree is not great but it's not on the same level as seeing
people's IPs. So I even think in an ideal world where the AI model is more
accurate than CU, we should stop using CU and rely solely on the AI instead
(important: I'm not implying the current model is better, I'm saying if it
was better). This would help us understand why for example fishing for sock
puppets with CU is bad (and banned by the policy) but fishing for socks
using this AI is not bad and can be a good starting point. In other words,
this tool being used right, can reduce check user actions and protect
people's privacy instead.
* People have been saying you need to teach AI to people so for example CUs
don't make wrong judgments based on this. I want to point out the examples
mentioned in the discussion are supervised machine learning which is AI but
not all of AI. This tool is not machine learning but it's AI (by heavily
relying on NLP) and for example it produces graphs and etc. and it wouldn't
give a number like "95% sure these two users are the same" which a
supervised machine learning model would do. I think reducing fingerprints
of people to just a number is inaccurate and harmful (life is not like a TV
crime series where a forensic scientist gives you the truth using some
magic). I write a detailed instruction on how to use it but it's not as bad
as you'd think, I leave a huge room for human judgment.
[1] Have fun (warning, explicit language):
https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki
[2] For knowing why, you can read this book on political science called
"The Dictator's handbook":
https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook
[3] From the fourth amendment of US constitution, you can find a similar
clause in every constitution.
Hope this responds to some concerns. Sorry for a long email.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l