Jaska Zedlik wrote:
Hi!
There are different apostrophe signs exist. Let's consider 2 of them: U+0027 and U+2019. They have the same meaning and both of them are acceptable and apostrophes for the English language, for instance. The problem is that MediaWiki internal search distinguishes these two apostrophes and the words containing U+2019 can't be found with the request containing U+0027 and vice versa.
Probably what we should be doing in this area is running text through Unicode compatibility composition normalization as well as some other character folding for punctuation forms where necessary. (UtfNormal::toNFKC() will merge things like full-width Roman characters but won't merge these related-but-not-quite-the-same punctuation forms.)
-- brion
MediaWiki uses a search index for the internal search and the index is renewed every time the article is saved. I have found that if to override the function stripForSearch() in the language class with the new function wich relpaces the U+2019 with U+0027 for search index it appears that the internal search begins to work properly not paying attention to which exactly apostrophe was provided in the search query, either U+0027 or U+2019. For sure, the context is not highlighted if the apostrophes differ in the query and in the result, but the search returns what is really needed.
The question is, if we override the stripForSearch() function in the language class in such a way, won't this cause any problems?
The code of the override function is the following:
function stripForSearch( $string ) { $s = $string; $s = preg_replace( '/\xe2\x80\x99/', ''', $s ); return parent::stripForSearch( $s ); }
We want to introduce such an issue for Belarusian, but I think Ukrainian language may experience the same problem with the different apostrophes, as U+0027 is not a valid apostrophe here as well, but only U+0027 (the typewriter apostrophe) is available on the majority of Belarusian and Ukrainian keyboard layouts.
Thanks, zedlik
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l