I am trying to understand the language related coding for Chinese: I have find that
the StripForSearch defined in utf8 and zh* were like this:
$out = preg_replace( "/([\xc0-\xff][\x80-\xbf]*)/e", "'U8' . bin2hex( strtr( "$1", $wikiSearchChars ) )", $string );
My question is :
Why the strtr is place in here but not
$out = preg_replace( "/([\xc0-\xff][\x80-\xbf]*)/e", "'U8' . bin2hex( "$1" )", strtr( $string, $wikiSearchChars ) );
I think I am right:
I have a little patch to language/languageUtf8.php
This allow the search of a to á à Á etc.
1. patch languageUtf8.php :
$wikiUpperChars = $wgMemc->get( $key1 = "$wgDBname:utf8:upper" ); $wikiLowerChars = $wgMemc->get( $key2 = "$wgDBname:utf8:lower" ); $wikiSearchChars = $wgMemc->get( $key3 = "$wgDBname:utf8:search" );
if(empty( $wikiUpperChars) || empty($wikiLowerChars )) { require_once( "includes/Utf8Case.php" ); $wgMemc->set( $key1, $wikiUpperChars ); $wgMemc->set( $key2, $wikiLowerChars ); $wgMemc->set( $key3, $wikiSearchChars );
function stripForSearch( $string ) { # MySQL fulltext index doesn't grok utf-8, so we # need to fold cases and convert to hex
# In Language:: it just returns lowercase, maybe # all strtolower on stripped output or argument # should be removed and all stripForSearch # methods adjusted to that.
wfProfileIn( "LanguageUtf8::stripForSearch" ); /* if( function_exists( 'mb_strtolower' ) ) { $out = preg_replace( "/([\xc0-\xff][\x80-\xbf]*)/e", "'U8' . bin2hex( "$1" )", mb_strtolower( $string ) ); } else { */ global $wikiSearchChars; $out = preg_replace( "/([\xc0-\xff][\x80-\xbf]*)/e", "' U8' . bin2hex( "$1" )", strtr( $string, $wikiSearchChars ) ); # } wfProfileOut( "LanguageUtf8::stripForSearch" ); return $out; }
Actually it is now mb_str* can not be use anymore! It just insufficient.
2. Add to includes/utf8Case.php
$wikiSearchChars which is identical to $wikiLowerChars except:
"\xc3\x80"=>"a", "\xc3\x81"=>"a", "\xc3\x82"=>"a", "\xc3\x83"=>"a", "\xc3\x84"=>"a", "\xc3\x85"=>"a", "\xc3\x86"=>"ae", "\xc3\x87"=>"c", "\xc3\x88"=>"e", "\xc3\x89"=>"e", "\xc3\x8a"=>"e", "\xc3\x8b"=>"e", "\xc3\x8c"=>"i", "\xc3\x8d"=>"i", "\xc3\x8e"=>"i", "\xc3\x8f"=>"i", "\xc3\x90"=>"d", "\xc3\x91"=>"n", "\xc3\x92"=>"o", "\xc3\x93"=>"o", "\xc3\x94"=>"o", "\xc3\x95"=>"o", "\xc3\x96"=>"o", "\xc3\x97"=>"x", "\xc3\x98"=>"o", "\xc3\x99"=>"u", "\xc3\x9a"=>"u", "\xc3\x9b"=>"u", "\xc3\x9c"=>"u", "\xc3\x9d"=>"y", "\xc3\x9e"=>"p", "\xc3\x9f"=>"b", "\xc3\xa0"=>"a", "\xc3\xa1"=>"a", "\xc3\xa2"=>"a", "\xc3\xa3"=>"a", "\xc3\xa4"=>"a", "\xc3\xa5"=>"a", "\xc3\xa6"=>"ae", "\xc3\xa7"=>"c", "\xc3\xa8"=>"e", "\xc3\xa9"=>"e", "\xc3\xaa"=>"e", "\xc3\xab"=>"e", "\xc3\xac"=>"i", "\xc3\xad"=>"i", "\xc3\xae"=>"i", "\xc3\xaf"=>"i", "\xc3\x90"=>"d", "\xc3\x91"=>"n", "\xc3\x92"=>"o", "\xc3\x93"=>"o", "\xc3\x94"=>"o", "\xc3\x95"=>"o", "\xc3\x96"=>"o", "\xc3\x98"=>"o", "\xc3\x99"=>"u", "\xc3\x9a"=>"u", "\xc3\x9b"=>"u", "\xc3\x9c"=>"u", "\xc3\x9d"=>"y", "\xc3\x9e"=>"p", "\xc3\x9f"=>"y",
I think it is not good to just use $wikiLowerChars or the name is mis-leading. I just add a new one just for search. There may be some conversion I am not very sure, and may be there are more needed to add into this.
I have test this together with Chinese. It works.
On 6/29/06, kent sin kentsin@gmail.com wrote:
I am trying to understand the language related coding for Chinese: I have find that
the StripForSearch defined in utf8 and zh* were like this:
$out = preg_replace( "/([\xc0-\xff][\x80-\xbf]*)/e", "'U8' . bin2hex( strtr( "$1", $wikiSearchChars ) )", $string );
My question is :
Why the strtr is place in here but not
$out = preg_replace( "/([\xc0-\xff][\x80-\xbf]*)/e", "'U8' . bin2hex( "$1" )", strtr( $string, $wikiSearchChars ) );
-- Sin Hang Kin.
mediawiki-l@lists.wikimedia.org