Steve Summit wrote:
Tim Starling wrote:
If we merge all these pairs into a set, following the relations, we obtain the result that latin n is the same as latin H. This is incorrect, and is the cause of most of the bizarre false positives that we see with AntiSpoof.
The problem is that merging sets is fairly fundamental to the way AntiSpoof works....
Clearly a more flexible/sophisticated approach, rather than calling all these characters "equivalent", would be to assign some quantitative visual difference between them, and when traversing a chain such as n -> eta -> Eta -> H, to sum the numbers (or something) rather than considering the equivalences to be a purely transitive relationship.
But obviously that's much more expensive than computing, storing, and indexing a single canonical representation for each string.
A hybrid approach I've contemplated (but not implemented, so I can't prove it works) is to use the canonical representations to generate expansive sets of candidate collisions, but then to do a more sophisticated (perhaps distance-based) comparison of just those candidates, to weed out the false positives.
I have already discussed something exactly like the above in E-mail.
As you have suggested above, the idea was to use the big dumb equivalence set table as a first hack to spot possible spoof candidates, and then to apply more sophisticaed processing using, among other things, the UTR#39 confusables.txt tables on up to N of the spoof candidates, falling back to the dumb algorithm if the number of candidates exceeds N, where N is perhaps 20. (This limit is needed to avoid denial of service attacks via the antispoof algorithm.)
Indeed, if this is implemented, the canonicalization function could be made even more of a catch-all, allow ing the catching of even more nasties than the existing code, since the second, more sophisticated, pass would then be able to clean up the larger number of false positives that would be generated by a more aggressive first pass.
I'd be happy to code this up in Python, for translation into PHP.
Anyone interested in this issue should consult Unicode Technical Standard #39, "Unicode Security Mechanisms", at http://www.unicode.org/ reports/tr39/. In particular, its discussion of "confusables" is basically the same issue we're talking about here. See also the Unicode data file "confusables.txt".
I'm actively working on this label-spoofing problem for another project, so I'm well aware of UTR #39. As Tim has observed, the current equivalence sets are the transitive closure of the equivalence relations in UTR #39's confusables.txt file (plus some extra nasties), the Unicode uppercasing relationships, and the relationships created by discarding combining marks to uncover the base character. The script-mixing constraints are also taken directly from UTR#39.
I've also got some suggestions that could be added to tighten up the existing integration into MediaWiki, by dealing with a couple of edge cases that are currently less than optimal.
-- Neil