On Sat, Jun 20, 2009 at 9:46 PM, Neil Harrisusenet@tonal.clara.co.uk wrote:
Regarding dashes and hyphens, I've now found my original data set, and a quick inspection gives this set of various similar-looking Latin hyphens, dashes and minus signs: U+002D HYPHEN-MINUS U+2010 HYPHEN U+2011 NON-BREAKING HYPHEN U+2012 FIGURE DASH U+2013 EN DASH
and at this point I missed out U+2014 EM DASH , which was hiding in the world of transitive closure mentioned below...
U+2212 MINUS SIGN U+FE58 SMALL EM DASH U+FF0D FULLWIDTH HYPHEN-MINUS
I think you have to be mindful of the original goal here: for each character a user is likely to enter from their keyboard in the search box, what possible range of characters would they expect to match?
So, apostrophe (U+0027) -> curved right single quote (U+2019): yes, probably. The other way around...probably not, unless that U+2019 exists on any keyboards.
Hyphen-minus (U+002D) -> em dash (U+2014): I would say no. If you search for "clock-work", you probably don't want to match a sentence like "He was building a clock—work that is never easy—at the time." (contrived, sure)
Just saying you probably don't want the full range of "lookalikes" - the left side of each mapping should be a keyboard character, and the right side should be semantically equivalent, unless commonly used incorrectly.
Steve