Tim Starling wrote:
If we merge all these pairs into a set, following the relations, we obtain the result that latin n is the same as latin H. This is incorrect, and is the cause of most of the bizarre false positives that we see with AntiSpoof.
The problem is that merging sets is fairly fundamental to the way AntiSpoof works....
Clearly a more flexible/sophisticated approach, rather than calling all these characters "equivalent", would be to assign some quantitative visual difference between them, and when traversing a chain such as n -> eta -> Eta -> H, to sum the numbers (or something) rather than considering the equivalences to be a purely transitive relationship.
But obviously that's much more expensive than computing, storing, and indexing a single canonical representation for each string.
A hybrid approach I've contemplated (but not implemented, so I can't prove it works) is to use the canonical representations to generate expansive sets of candidate collisions, but then to do a more sophisticated (perhaps distance-based) comparison of just those candidates, to weed out the false positives.
Anyone interested in this issue should consult Unicode Technical Standard #39, "Unicode Security Mechanisms", at http://www.unicode.org/ reports/tr39/. In particular, its discussion of "confusables" is basically the same issue we're talking about here. See also the Unicode data file "confusables.txt".