Re: [Wikitech-l] AntiSpoof issues

12 Nov 2006


      Neil Harris wrote:
...
Hi Tim;
I've already thought of this (see my recent E-mail on the Wikitech list 
-- for some reason, I can't find the lengthy E-mail I thought I'd sent 
earlier that I refer to there).
Fortunately, not much "real cleverness" is needed.
The basic idea is the one suggested by multiple posters on the list:

an aggressive canonicalization process (which must still have the

transitivity requirement above)

looking up candidates with matching canonical forms (up to some limit,

perhaps 20, to stop denial-of-service attacks)

if #(candidates) > limit, treat as a spoof, to fail-safe
then a second pass to do the checking _much_ more carefully, without

any need for transitivity or over-compression
I'd be happy to E-mail you an implementation in Python of the very 
simple but more careful second-pass code, as a function 
are_confusable_strings() that takes two Python strings as input, and 
returns a boolean value. This can then be called from the PHP pass.
Sure, email away.
...
If we do this, we should be able to make the first pass even more 
aggressive than it is currently, to catch more possible spoof 
candidates, whilst still eliminating false positives in the second pass, 
thus improving both the false-positive and false-negative rates to a 
fraction of their current levels.
Generally speaking, you can't tell whether a given pair of names is an
attempted spoof just by comparing the strings. You need to know the
motivation of the person who created it. On the one hand we have users who
want to find the minimal variation of their given name or Internet nickname
that isn't already taken, and on the other hand, we have trolls who want to
find the minimal variation of an existing username that isn't disallowed by
the software. Both users wish to evade the software restrictions, but one of
them has a motivation that we will tolerate, and one of them does not.
As Gregory suggested, one useful heuristic would be to look at the number of
edits of the target user. Another one that I proposed on IRC yesterday is a
length heuristic -- i.e. collisions of short usernames are more likely to be
accidental than collisions of long ones.
...
We should _not_ remove the cross-script pairs from the list, as there 
are still whole-script confusables, eg "caxap", "soccer" -- 
surprisingly, 3% of English dictionary words have matching Cyrillic 
spoofs, and 1% have Greek spoofs -- however, the second pass should 
completely eliminate any problems caused by the transitivity in the 
first pass.
We have to remove some of the cross-script pairs until the software is
changed, to fix the spurious within-script conflations. I'm not going to
make everyone suffer while we have our leisurely chat about possible
long-term fixes.
There is a need for judgement, regardless of the software in use. Trolls
will go on trolling regardless of what anti-spoofing restrictions we have in
place. Our aim should be to minimise their impact, and heuristic systems
with a high false positive rate do quite the opposite.
-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] AntiSpoof issues