On 11/12/06, Steve Summit scs@eskimo.com wrote: [snip]
A hybrid approach I've contemplated (but not implemented, so I can't prove it works) is to use the canonical representations to generate expansive sets of candidate collisions, but then to do a more sophisticated (perhaps distance-based) comparison of just those candidates, to weed out the false positives.
[snip]
Woops. /me reminds self to read thread before replying.
Yes, this is an interesting idea. If anyone codes whats proposed, it would be useful to extend it to support multiple compression functions, for example in addition to the simmar chacter metric it would be useful to have a comparison based on double metaphone:
dmetaphone('Sterling') == dmetaphone('Starling') //Indexed lookup levenshtein('Tim Starling','Tim Sterling') == 1 //Second pass
(I have no clue if php has handy standard library functions for dmetaphone and levenshtein distance.. I'm using the ones in postgresql.)