We've been having quite a few complaints about false positives from the AntiSpoof extension -- an extension which attempts to prevent registration of names which are confusingly similar to names already registered. Brion responded to these complaints with "get a sysop to make the account for you", but I don't think that's a very good solution. So I've been working on the AntiSpoof extension today, attempting to make it a bit more relaxed.
The most fundamental problem is the problem of merging sets. Say if we want to treat visually similar characters as part of a set, and we also want to treat letters which are the same except for their case as part of a set. So, for example, say if we have the following pairs:
Η (capital eta) = H (latin) Η (capital eta) = η (lowercase eta) η (lowercase eta) = n (latin)
If we merge all these pairs into a set, following the relations, we obtain the result that latin n is the same as latin H. This is incorrect, and is the cause of most of the bizarre false positives that we see with AntiSpoof.
The problem is that merging sets is fairly fundamental to the way AntiSpoof works -- i.e. by calculating a canonical representation of the username, storing it and indexing it. So it's not going to change any time soon unless we get really clever. But there are some things we can do to minimise its effects.
The first and most obvious thing to do was to remove the transliteration pairs. These are pairs of characters where one member of the pair is a common phonetic transliteration of the other, e.g. cyrillic en "Н" = latin E. This was the cause of most of the spurious conflations between latin characters. This should now be done.
There are now three remaining categories of conflated character pairs: case folding, visual similarity and chinese traditional/simplified conversion.
The second thing to do is to minimise cross-script pairs. Since cross-script usernames are disallowed, cross-script pairs are mostly redundant. You could make a case to leave some of them in, for example some latin usernames can be spoofed entirely using cyrillic characters. And some communities may have a special need for allowing a certain pair of scripts in a username (e.g. latin and hiragana). It's best if we can just keep the pairs which are visually very similar, and consciously avoid including cross-script pairs which will cause false conflations within scripts.
I've done some work on this, but I think it's time to hand over the job to the community, if the community wants it. I've created a page with a big list of pairs, at:
http://www.mediawiki.org/wiki/AntiSpoof/Equivalence_sets
You can edit this page. I will update the live copy on request.
Really clever ideas on how to avoid merging sets while maintaining good performance would be appreciated.
Another misfeature in AntiSpoof which was causing false positives was the fact that it merged sequences of repeated characters. For example, Yuma was considered to be equal to Uma, because Y=U (from a transliteration pair), and UUma = Uma. I've removed this behaviour.
I should really get a blog...
-- Tim Starling