Re: [Wikitech-l] AntiSpoof issues

13 Nov 2006


      Steve Summit wrote:
...
Tim Starling wrote:
...
If we merge all these pairs into a set, following the relations, we obtain
the result that latin n is the same as latin H. This is incorrect, and is
the cause of most of the bizarre false positives that we see with AntiSpoof.
The problem is that merging sets is fairly fundamental to the way AntiSpoof
works....
Clearly a more flexible/sophisticated approach, rather than
calling all these characters "equivalent", would be to assign
some quantitative visual difference between them, and when
traversing a chain such as n -> eta -> Eta -> H, to sum the
numbers (or something) rather than considering the equivalences
to be a purely transitive relationship.
But obviously that's much more expensive than computing, storing,
and indexing a single canonical representation for each string.
A hybrid approach I've contemplated (but not implemented, so I
can't prove it works) is to use the canonical representations to
generate expansive sets of candidate collisions, but then to do
a more sophisticated (perhaps distance-based) comparison of just
those candidates, to weed out the false positives.
I have already discussed something exactly like the above in E-mail.
As you have suggested above, the idea was to use the big dumb 
equivalence set table as a first hack to spot possible spoof candidates, 
and then to apply more sophisticaed processing using, among other 
things, the UTR#39 confusables.txt tables on up to N of the spoof 
candidates, falling back to the dumb algorithm if the number of 
candidates exceeds N, where N is perhaps 20. (This limit is needed to 
avoid denial of service attacks via the antispoof algorithm.)
Indeed, if this is implemented, the canonicalization function could be 
made even more of a catch-all, allow ing the catching of even more 
nasties than the existing code, since the second, more sophisticated, 
pass would then be able to clean up the larger number of false positives 
that would be generated by a more aggressive first pass.
I'd be happy to code this up in Python, for translation into PHP.
...
Anyone interested in this issue should consult Unicode Technical
Standard #39, "Unicode Security Mechanisms", at http://www.unicode.org/
reports/tr39/.  In particular, its discussion of "confusables"
is basically the same issue we're talking about here.  See also
the Unicode data file "confusables.txt".
I'm actively working on this label-spoofing problem for another project, 
so I'm well aware of UTR #39. As Tim has observed, the current 
equivalence sets are the transitive closure of the equivalence relations 
in UTR #39's confusables.txt file (plus some extra nasties), the Unicode 
uppercasing relationships, and the relationships created by discarding 
combining marks to uncover the base character. The script-mixing 
constraints are also taken directly from UTR#39.
I've also got some suggestions that could be added to tighten up the 
existing integration into MediaWiki, by dealing with a couple of edge 
cases that are currently less than optimal.
-- Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] AntiSpoof issues