On 11/12/06, Tim Starling tstarling@wikimedia.org wrote: [snip]
The problem is that merging sets is fairly fundamental to the way AntiSpoof works -- i.e. by calculating a canonical representation of the username, storing it and indexing it.
[snip]
Two pass:
Use the current high compression function to locate candate matches nice and quickly from a non-unique index.
Then take the real potential match names and compare them directly using a more intelligent comparison. (i.e. 'n'!='H').
The compression function could be made more lossy so that it will identify a large but not unreasonable number of potentials.
We could even assign points to varrious kinds of matches and deny past a threshold. This would also make it easier to support bi/trigram triggers such as cI ~= d .. which perhaps get more interesting when we consider the entire UTF-8 charset.