Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search

20 Jun 2009


      Andrew Dunbar wrote:
...
2009/6/20 Neil Harris usenet@tonal.clara.co.uk:
...
Neil Harris wrote:
...
Andrew Dunbar wrote:
...
2009/6/20 Jaska Zedlik jz53zc@gmail.com:
...
Hello,
On Fri, Jun 19, 2009 at 20:31, Rolf Lampa rolf.lampa@rilnet.com wrote:
...
Jaska Zedlik skrev:
<...>
> The code of the override function is the following:
>
> function stripForSearch( $string ) {
>   $s = $string;
>   $s = preg_replace( '/\xe2\x80\x99/', ''', $s );
>   return parent::stripForSearch( $s );
> }
>
>
>               
I'm not a PHP programmer, but why using the extra assignment of $s
instead of using $string directly in the parent call, like so:
function stripForSearch( $string ) {
    $s = preg_replace( '/\xe2\x80\x99/', ''', $string );
    return parent::stripForSearch( $s );
}
Really, you are right, for the real function all these redundant assignments
should be strepped for the productivity purposes, I just used a framework
from the Japanese language class which does soma Japanese-specific
reduction, but I agree with your notice.
The username anti-spoofing code already knows about a lot of "looks similar"
characters which may be of some help.
Andrew Dunbar (hippietrail)
Of itself, the username anti-spoofing code table -- which I originally
wrote -- is rather too thorough for this purpose, since it deliberately
errs on the side of mapping even vaguely similar-looking characters to
one another, regardless of character type and script system,and this,
combined with case-folding and transitivity, leads to some apparently
bizarre mappings that are of no practical use for any other application.
If you're interested, I can take a look at producing a more limited
punctuation-only version.
-- Neil
http://www.unicode.org/reports/tr39/data/confusables.txt is probably the
single best source for information about visual confusables.
Staying entirely within the Latin punctuation repertoire, and avoiding
combining characters and other exotica such as math characters and
dingbats, you might want to consider the following characters as
possible unintentional lookalikes for the apostrophe:
U+0027 APOSTROPHE
U+2019 RIGHT SINGLE QUOTATION MARK
U+2018 LEFT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+2032 PRIME
U+00B4 ACUTE ACCENT
U+0060 GRAVE ACCENT
U+FF40 FULLWIDTH GRAVE ACCENT
U+FF07 FULLWIDTH APOSTROPHE
There are also lots of other characters that look like these from other
languages, and various combining character combinations which could also
look the same, but I doubt whether they would be generated in Latin text
by accident.
I would add
U+02BB MODIFIER LETTER TURNED COMMA (hawaiian 'okina)
U+02C8 MODIFIER LETTER VERTICAL LINE (IPA primary stress mark)
It might be worthwhile folding some dashes and hyphens too.
Andrew Dunbar (hippietrail)
Interestingly, following up the above, I've found one source 
(http://snowball.tartarus.org/texts/apostrophe.html) that states that 
U+201B may be deliberately used as an apostrophe variant by some 
publishers in some contexts.
Regarding dashes and hyphens, I've now found my original data set, and a 
quick inspection gives this set of various similar-looking Latin 
hyphens, dashes and minus signs:
U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
U+2212 MINUS SIGN
U+FE58 SMALL EM DASH
U+FF0D FULLWIDTH HYPHEN-MINUS
I can send the full data set of lookalikes to anyone who is interested: 
it can be quite easily extended by regarding the relation "looks like" 
as transitive, to include more distant and linguistically dubious visual 
confusables such as (just for example) U+2015 HORIZONTAL BAR, U+1173 
HANGUL JUNGSEONG EU and U+2F00 KANGXI RADICAL ONE.
-- Neil

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search