On Sat, Jun 20, 2009 at 9:46 PM, Neil Harris<usenet(a)tonal.clara.co.uk> wrote:
Regarding
dashes and hyphens, I've now found my original data set, and
a quick inspection gives this set of various similar-looking Latin
hyphens, dashes and minus signs:
U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
and at this point I missed out U+2014 EM DASH , which was hiding in the
world of transitive closure mentioned below...
> U+2212 MINUS SIGN
> U+FE58 SMALL EM DASH
> U+FF0D FULLWIDTH HYPHEN-MINUS
I think you have to be mindful of the original goal here: for each
character a user is likely to enter from their keyboard in the search
box, what possible range of characters would they expect to match?
So, apostrophe (U+0027) -> curved right single quote (U+2019): yes, probably.
The other way around...probably not, unless that U+2019 exists on any keyboards.
Hyphen-minus (U+002D) -> em dash (U+2014): I would say no. If you
search for "clock-work", you probably don't want to match a sentence
like "He was building a clock—work that is never easy—at the time."
(contrived, sure)
Just saying you probably don't want the full range of "lookalikes" -
the left side of each mapping should be a keyboard character, and the
right side should be semantically equivalent, unless commonly used
incorrectly.
Steve