I think I am right:
I have a little patch to language/languageUtf8.php
This allow the search of a to á à Á etc.
1. patch languageUtf8.php :
$wikiUpperChars = $wgMemc->get( $key1 = "$wgDBname:utf8:upper" );
$wikiLowerChars = $wgMemc->get( $key2 = "$wgDBname:utf8:lower" );
$wikiSearchChars = $wgMemc->get( $key3 = "$wgDBname:utf8:search" );
if(empty( $wikiUpperChars) || empty($wikiLowerChars )) {
require_once( "includes/Utf8Case.php" );
$wgMemc->set( $key1, $wikiUpperChars );
$wgMemc->set( $key2, $wikiLowerChars );
$wgMemc->set( $key3, $wikiSearchChars );
function stripForSearch( $string ) {
# MySQL fulltext index doesn't grok utf-8, so we
# need to fold cases and convert to hex
# In Language:: it just returns lowercase, maybe
# all strtolower on stripped output or argument
# should be removed and all stripForSearch
# methods adjusted to that.
wfProfileIn( "LanguageUtf8::stripForSearch" );
/* if( function_exists( 'mb_strtolower' ) ) {
$out = preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"'U8' . bin2hex( \"$1\" )",
mb_strtolower( $string ) );
} else { */
global $wikiSearchChars;
$out = preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"' U8' . bin2hex( \"$1\" )",
strtr( $string, $wikiSearchChars ) );
# }
wfProfileOut( "LanguageUtf8::stripForSearch" );
return $out;
}
Actually it is now mb_str* can not be use anymore! It just insufficient.
2. Add to includes/utf8Case.php
$wikiSearchChars which is identical to $wikiLowerChars except:
"\xc3\x80"=>"a",
"\xc3\x81"=>"a",
"\xc3\x82"=>"a",
"\xc3\x83"=>"a",
"\xc3\x84"=>"a",
"\xc3\x85"=>"a",
"\xc3\x86"=>"ae",
"\xc3\x87"=>"c",
"\xc3\x88"=>"e",
"\xc3\x89"=>"e",
"\xc3\x8a"=>"e",
"\xc3\x8b"=>"e",
"\xc3\x8c"=>"i",
"\xc3\x8d"=>"i",
"\xc3\x8e"=>"i",
"\xc3\x8f"=>"i",
"\xc3\x90"=>"d",
"\xc3\x91"=>"n",
"\xc3\x92"=>"o",
"\xc3\x93"=>"o",
"\xc3\x94"=>"o",
"\xc3\x95"=>"o",
"\xc3\x96"=>"o",
"\xc3\x97"=>"x",
"\xc3\x98"=>"o",
"\xc3\x99"=>"u",
"\xc3\x9a"=>"u",
"\xc3\x9b"=>"u",
"\xc3\x9c"=>"u",
"\xc3\x9d"=>"y",
"\xc3\x9e"=>"p",
"\xc3\x9f"=>"b",
"\xc3\xa0"=>"a",
"\xc3\xa1"=>"a",
"\xc3\xa2"=>"a",
"\xc3\xa3"=>"a",
"\xc3\xa4"=>"a",
"\xc3\xa5"=>"a",
"\xc3\xa6"=>"ae",
"\xc3\xa7"=>"c",
"\xc3\xa8"=>"e",
"\xc3\xa9"=>"e",
"\xc3\xaa"=>"e",
"\xc3\xab"=>"e",
"\xc3\xac"=>"i",
"\xc3\xad"=>"i",
"\xc3\xae"=>"i",
"\xc3\xaf"=>"i",
"\xc3\x90"=>"d",
"\xc3\x91"=>"n",
"\xc3\x92"=>"o",
"\xc3\x93"=>"o",
"\xc3\x94"=>"o",
"\xc3\x95"=>"o",
"\xc3\x96"=>"o",
"\xc3\x98"=>"o",
"\xc3\x99"=>"u",
"\xc3\x9a"=>"u",
"\xc3\x9b"=>"u",
"\xc3\x9c"=>"u",
"\xc3\x9d"=>"y",
"\xc3\x9e"=>"p",
"\xc3\x9f"=>"y",
I think it is not good to just use $wikiLowerChars or the name is
mis-leading. I just add a new one just for search. There may be some
conversion I am not very sure, and may be there are more needed to add
into this.
I have test this together with Chinese. It works.
On 6/29/06, kent sin <kentsin(a)gmail.com> wrote:
I am trying to understand the language related coding
for Chinese: I
have find that
the StripForSearch defined in utf8 and zh* were like this:
$out = preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"'U8' . bin2hex( strtr( \"\$1\",
\$wikiSearchChars ) )",
$string );
My question is :
Why the strtr is place in here but not
$out = preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"'U8' . bin2hex( \"\$1\" )",
strtr( $string, $wikiSearchChars ) );
--
Sin Hang Kin.
--
Sin Hang Kin.