Neil Harris wrote:
Hi Tim;
I've already thought of this (see my recent E-mail on the Wikitech list -- for some reason, I can't find the lengthy E-mail I thought I'd sent earlier that I refer to there).
Fortunately, not much "real cleverness" is needed.
The basic idea is the one suggested by multiple posters on the list:
- an aggressive canonicalization process (which must still have the
transitivity requirement above)
- looking up candidates with matching canonical forms (up to some limit,
perhaps 20, to stop denial-of-service attacks)
- if #(candidates) > limit, treat as a spoof, to fail-safe
- then a second pass to do the checking _much_ more carefully, without
any need for transitivity or over-compression
I'd be happy to E-mail you an implementation in Python of the very simple but more careful second-pass code, as a function are_confusable_strings() that takes two Python strings as input, and returns a boolean value. This can then be called from the PHP pass.
Sure, email away.
If we do this, we should be able to make the first pass even more aggressive than it is currently, to catch more possible spoof candidates, whilst still eliminating false positives in the second pass, thus improving both the false-positive and false-negative rates to a fraction of their current levels.
Generally speaking, you can't tell whether a given pair of names is an attempted spoof just by comparing the strings. You need to know the motivation of the person who created it. On the one hand we have users who want to find the minimal variation of their given name or Internet nickname that isn't already taken, and on the other hand, we have trolls who want to find the minimal variation of an existing username that isn't disallowed by the software. Both users wish to evade the software restrictions, but one of them has a motivation that we will tolerate, and one of them does not.
As Gregory suggested, one useful heuristic would be to look at the number of edits of the target user. Another one that I proposed on IRC yesterday is a length heuristic -- i.e. collisions of short usernames are more likely to be accidental than collisions of long ones.
We should _not_ remove the cross-script pairs from the list, as there are still whole-script confusables, eg "caxap", "soccer" -- surprisingly, 3% of English dictionary words have matching Cyrillic spoofs, and 1% have Greek spoofs -- however, the second pass should completely eliminate any problems caused by the transitivity in the first pass.
We have to remove some of the cross-script pairs until the software is changed, to fix the spurious within-script conflations. I'm not going to make everyone suffer while we have our leisurely chat about possible long-term fixes.
There is a need for judgement, regardless of the software in use. Trolls will go on trolling regardless of what anti-spoofing restrictions we have in place. Our aim should be to minimise their impact, and heuristic systems with a high false positive rate do quite the opposite.
-- Tim Starling