Re: [Wikitech-l] AntiSpoof issues

12 Nov 2006


      Tim Starling wrote:
...
If we merge all these pairs into a set, following the relations, we obtain
the result that latin n is the same as latin H. This is incorrect, and is
the cause of most of the bizarre false positives that we see with AntiSpoof.
The problem is that merging sets is fairly fundamental to the way AntiSpoof
works....
Clearly a more flexible/sophisticated approach, rather than
calling all these characters "equivalent", would be to assign
some quantitative visual difference between them, and when
traversing a chain such as n -> eta -> Eta -> H, to sum the
numbers (or something) rather than considering the equivalences
to be a purely transitive relationship.
But obviously that's much more expensive than computing, storing,
and indexing a single canonical representation for each string.
A hybrid approach I've contemplated (but not implemented, so I
can't prove it works) is to use the canonical representations to
generate expansive sets of candidate collisions, but then to do
a more sophisticated (perhaps distance-based) comparison of just
those candidates, to weed out the false positives.
Anyone interested in this issue should consult Unicode Technical
Standard #39, "Unicode Security Mechanisms", at http://www.unicode.org/
reports/tr39/.  In particular, its discussion of "confusables"
is basically the same issue we're talking about here.  See also
the Unicode data file "confusables.txt".

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] AntiSpoof issues