[Wikitech-l] AntiSpoof issues

12 Nov 2006


      We've been having quite a few complaints about false positives from the
AntiSpoof extension -- an extension which attempts to prevent registration
of names which are confusingly similar to names already registered. Brion
responded to these complaints with "get a sysop to make the account for
you", but I don't think that's a very good solution. So I've been working on
the AntiSpoof extension today, attempting to make it a bit more relaxed.
The most fundamental problem is the problem of merging sets. Say if we want
to treat visually similar characters as part of a set, and we also want to
treat letters which are the same except for their case as part of a set. So,
for example, say if we have the following pairs:
Η (capital eta) = H (latin)
Η (capital eta) = η (lowercase eta)
η (lowercase eta) = n (latin)
If we merge all these pairs into a set, following the relations, we obtain
the result that latin n is the same as latin H. This is incorrect, and is
the cause of most of the bizarre false positives that we see with AntiSpoof.
The problem is that merging sets is fairly fundamental to the way AntiSpoof
works -- i.e. by calculating a canonical representation of the username,
storing it and indexing it. So it's not going to change any time soon unless
we get really clever. But there are some things we can do to minimise its
effects.
The first and most obvious thing to do was to remove the transliteration
pairs. These are pairs of characters where one member of the pair is a
common phonetic transliteration of the other, e.g. cyrillic en "Н" = latin
E. This was the cause of most of the spurious conflations between latin
characters. This should now be done.
There are now three remaining categories of conflated character pairs: case
folding, visual similarity and chinese traditional/simplified conversion.
The second thing to do is to minimise cross-script pairs. Since cross-script
usernames are disallowed, cross-script pairs are mostly redundant. You could
make a case to leave some of them in, for example some latin usernames can
be spoofed entirely using cyrillic characters. And some communities may have
a special need for allowing a certain pair of scripts in a username (e.g.
latin and hiragana). It's best if we can just keep the pairs which are
visually very similar, and consciously avoid including cross-script pairs
which will cause false conflations within scripts.
I've done some work on this, but I think it's time to hand over the job to
the community, if the community wants it. I've created a page with a big
list of pairs, at:
http://www.mediawiki.org/wiki/AntiSpoof/Equivalence_sets
You can edit this page. I will update the live copy on request.
Really clever ideas on how to avoid merging sets while maintaining good
performance would be appreciated.
Another misfeature in AntiSpoof which was causing false positives was the
fact that it merged sequences of repeated characters. For example, Yuma was
considered to be equal to Uma, because Y=U (from a transliteration pair),
and UUma = Uma. I've removed this behaviour.
I should really get a blog...
-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] AntiSpoof issues