Hello,
On Fri, Jun 19, 2009 at 23:28, Brion Vibber brion@wikimedia.org wrote:
Jaska Zedlik wrote:
Hi!
There are different apostrophe signs exist. Let's consider 2 of them: U+0027 and U+2019. They have the same meaning and both of them are acceptable and apostrophes for the English language, for instance. The problem is that MediaWiki internal search distinguishes these two apostrophes and the words containing U+2019 can't be found with the request containing U+0027 and vice versa.
Probably what we should be doing in this area is running text through Unicode compatibility composition normalization as well as some other character folding for punctuation forms where necessary. (UtfNormal::toNFKC() will merge things like full-width Roman characters but won't merge these related-but-not-quite-the-same punctuation forms.)
-- brion
As I understand, this is not a Unicode compatibility composition, as these are 2 different charachters (U+2019 even defined as Right Single Quotation Mark), but in some languages (not for all, for sure) they could have identical meaning. As the characters are different, I'm afraid they are not covered by the Unicode normalization process, and we should deal with the functions available in the language class.
zedlik