Please stop calling this an “AI” system, it is not. It is statistical learning.
This is probably not going to make me popular…
In some jurisdictions you will need a permit to create, manage, and store biometric identifiers, no matter if the biometric identifier is for a known person or not. If you want to create biometric identifiers, and use them, make darn sure you follow every applicable law and rule. I'm not amused by the idea of having CUs using illegal tools to wet ordinary users.
Any system that tries to remove anonymity og users on Wikipedia should have an RfC where the community can make their concerns heard. This is not the proper forum to get acceptance from Wikipedias community.
And btw, systems for cleanup of prose exists for a whole bunch of languages, not only English. Grammarly is one, LanguageTool another, and there are a whole bunch other such tools.
lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani ladsgroup@gmail.com:
Thank you all for the responses, I try to summarize my responses here.
- By closed source, I don't mean it will be only accessible to me, It's
already accessible by another CU and one WMF staff, and I would gladly share the code with anyone who has signed NDA and they are of course more than welcome to change it. Github has a really low limit for people who can access a private repo but I would be fine with any means to fix this.
- I have read that people say that there are already public tools to
analyze text. I disagree, 1- The tools you mentioned are for English and not other languages (maybe I missed something) and even if we imagine there would be such tools for big languages like German and/or French, they don't cover lots of languages unlike my tool that's basically language agnostic and depends on the volume of discussions happened in the wiki.
- I also disagree that it's not hard to build. I have lots of experience
with NLP (with my favorite work being a tool that finds swear words in every language based on history of vandalism in that Wikipedia [1]) and still it took me more than a year (a couple of hours almost in every weekend) to build this, analyzing a pure clean text is not hard, cleaning up wikitext and templates and links to get only text people "spoke" is doubly hard, analyzing user signatures brings only suffer and sorrow.
- While in general I agree if a government wants to build this, they can
but reality is more complicated and this situation is similar to security. You can never be 100% secure but you can increase the cost of hacking you so much that it would be pointless for a major actor to do it. Governments have a limited budget and dictatorships are by design corrupt and filled with incompotent people [2] and sanctions put another restrain on such governments too so I would not give them such opportunity for oppersion in a silver plate for free, if they really want to, then they must pay for it (which means they can't use that money/resources on oppersing some other groups).
- People have said this AI is easy to be gamed, while it's not that easy
and the tools you mentioned are limited to English, it's still a big win for the integrity of our projects. It boils down again to increasing the cost. If a major actor wants to spread disinformation, so far they only need to fake their UA and IP which is a piece of cake and I already see that (as a CU) but now they have to mess with UA/IP AND change their methods of speaking (which is one order of magnitude harder than changing IP). As I said, increasing this cost might not prevent it from happening but at least it takes away the ability of oppressing other groups.
- This tool never will be the only reason to block a sock. It's more than
anything a helper, if CU brings a large range and they are similar but the result is not conclusive, this tool can help. Or when we are 90% sure it's a WP:DUCK, this tool can help too but blocking just because this tool said so would imply a "Minority report" situation and to be honest and I would really like to avoid that. It is supposed to empower CUs.
- Banning using this tool is not possible legally, the content of Wikipedia
is published under CC-BY-SA and this allows such analysis specially you can't ban an offwiki action. Also, if a university professor can do it, I don't see the point of banning using it by the most trusted group of users (CUs). You can ban blocking based on this tool but I don't think we should block solely based on this anyway.
- It has been pointed out by people in the checkuser mailing list that
there's no point in logging accessing this tool, since the code is accessible to CUs (if they want to), so they can download and run it on their computer without logging anyway.
- There is a huge difference between CU and this AI tool in matters of
privacy. While both are privacy sensitive but CU reveals much more, as a CU, I know where lots of people are living or studying because they showed up in my CUs and while I won't tell a soul about them but it makes me uncomfortable (I'm also not implying CUs are not trusted, it's just we should respect people's privacy and avoid "unreasonable search and seizure"[3]) but this tool only reveals a connection between accounts if one of them is linked to a public identity and the other is not which I wholeheartedly agree is not great but it's not on the same level as seeing people's IPs. So I even think in an ideal world where the AI model is more accurate than CU, we should stop using CU and rely solely on the AI instead (important: I'm not implying the current model is better, I'm saying if it was better). This would help us understand why for example fishing for sock puppets with CU is bad (and banned by the policy) but fishing for socks using this AI is not bad and can be a good starting point. In other words, this tool being used right, can reduce check user actions and protect people's privacy instead.
- People have been saying you need to teach AI to people so for example CUs
don't make wrong judgments based on this. I want to point out the examples mentioned in the discussion are supervised machine learning which is AI but not all of AI. This tool is not machine learning but it's AI (by heavily relying on NLP) and for example it produces graphs and etc. and it wouldn't give a number like "95% sure these two users are the same" which a supervised machine learning model would do. I think reducing fingerprints of people to just a number is inaccurate and harmful (life is not like a TV crime series where a forensic scientist gives you the truth using some magic). I write a detailed instruction on how to use it but it's not as bad as you'd think, I leave a huge room for human judgment.
[1] Have fun (warning, explicit language): https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki [2] For knowing why, you can read this book on political science called "The Dictator's handbook": https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook [3] From the fourth amendment of US constitution, you can find a similar clause in every constitution.
Hope this responds to some concerns. Sorry for a long email. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l