[Mediawiki-l] Does Lucene-Search Support Japanese?

Ross Xu rossxunix at yahoo.ca
Fri Oct 1 14:01:13 UTC 2010


It's very kind of you, Robert!
I didn't know the lsearch-global.conf could be so powerful, and so complicated.
My one is much simpler. Here is it ...
----------------------------------------------------
################################################
# Global search cluster layout configuration
################################################
[Database]
wikidb3 : (single) (spell,4,2) (language,ja)
 
[Search-Group]
server08 : wikidb3 wikidb3.spell
 
[Index]
server08 : *
 
[Index-Path]
<default> : /search
 
[OAI]
<default> : http://server08/products/documentation/wiki3/index.php
 
[Namespace-Boost]
<default> : (0,2) (1,0.5)
 
[Namespace-Prefix]
all : <all>
[0] : 0
[1] : 1
[2] : 2
[3] : 3
[4] : 4
[5] : 5
[6] : 6
[7] : 7
[8] : 8
[9] : 9
[10] : 10
[11] : 11
[12] : 12
[13] : 13
[14] : 14
[15] : 15
----------------------------------------------------
 
Thanks again,
Ross Xu

--- On Fri, 10/1/10, Robert Stojnic <rainmansr at gmail.com> wrote:


From: Robert Stojnic <rainmansr at gmail.com>
Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
To: "MediaWiki announcements and site admin list" <mediawiki-l at lists.wikimedia.org>
Received: Friday, October 1, 2010, 2:15 AM



Hi Ross,

Wikipedia is using lucene-search. You can view the configuration we are 
using here:
http://noc.wikimedia.org/conf/lsearch-global-2.1.conf

If you send me your lsearch-global.conf file I can edit it so that it 
works for you, if my instructions weren't clear.

Cheers, Robert

On 30/09/10 18:53, Ross Xu wrote:
> Hi Robert,
> Thanks a lot for your kindness, and your reply!
>   
> But I don't exactly know what the "remedy" is. Did you mean the "[Search-Group]" section in the lsearch-global.conf file? I tried the setup for the [Search-Group] section, but there was no difference. I can't find where to set up [SearchGroup] under "MediaWiki" instead of the lucene backend.
>   
> BTW, do you know which search tool the Wikipedia is using? I can search Janapese at Wikipedia without problem.
>   
> Thanks again,
> Ross Xu
>
>
> --- On Wed, 9/29/10, Robert Stojnic<rainmansr at gmail.com>  wrote:
>
>
> From: Robert Stojnic<rainmansr at gmail.com>
> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
> To: "MediaWiki announcements and site admin list"<mediawiki-l at lists.wikimedia.org>
> Received: Wednesday, September 29, 2010, 8:09 PM
>
>
>
> Hi Ross,
>
> 1) The character doubling in CJK languages is a known bug, and the only
> remedy right now is to have MediaWiki (instead of the lucene backend) do
> the snippet extraction. To do this, make you setup like this:
>
> [SearchGroup]
> your_host : wikidb wikidb.spell
>
> (instead of using wildcard character "*". Note the wikidb.hl which
> contains the snippet highlighting information is missing)
>
> 2) There is currently no way of specifying multiple languages. In the
> case of english/japanese it might be easy to distinguish them, but in
> the general case it is quite difficult. As a result, this hasn't been
> implemented.
>
> As for it being 5 months, no-one is working on improving the search for
> more than a year. There has been the occasional bugfixing and
> maintenance, but no major features like the ones you're suggesting.
> No-one is contracted or paid to do it, and I don't have any more free
> time to put into it. So, I'm afraid that you are on your own on this,
> and if you want things moving for the japanese search, you will have to
> do it yourself.
>
> Cheers, Robert
>
> On 28/09/10 20:30, Ross Xu wrote:
>    
>> Hi there,
>> It's been 5 months, but I am still having this problem.
>> I have upgraded my MW to 1.16.0, and upgraded my MWSearch for 1.16.0. The Lucene-Search is still the newest 2.1.3.
>>     
>> If I set (language,ja) in the [Database] section in the lsearch-global.conf file, run ./build, and search for "Checkers:BSTR.FUNC.LEN", I get:
>> ##########################
>> Checkers:BSTR.FUNC.LEN
>> Attempt to get length of non-BSTR string using SysStringLen or SysStringByteLen ... 1 void bstr_len() 2 wchar_t a L"abc";3 int l SysStringLen(a ... 530 B (82 words) - 13:42, 30 September 2009
>>
>> Checkers:BSTR.FUNC.LEN/ja
>> SysStringLen 関数数ままたたは SysStringByteLen 関数数をを使使用用しして、BSTR 以外外のの文文字字列列のの長長ささをを取取得得ししよ ... 1 void bstr_len() 2 wchar_t a L"abc";3 int l ... 669 B (160 words) - 13:42, 30 September 2009
>> View (previous 20 | next 20) (20 | 50 | 100 | 250 | 500)
>> ##########################
>> As you can see, I get two pages/items in the search results - one is English version, and the other is Japanese version, and each Japanese character is repeated. But I get nothing if I search for any Japanese characters (e.g. "関数").
>>     
>> If I set (language,en) in the [Database] section in the lsearch-global.conf file, run ./build, and search for "Checkers:BSTR.FUNC.LEN", I get the same results (two pages - one English version, and one Japanese version), but Japanese characters are NOT repeated. However, I still can't search for any Japanese characters (e.g. "関数").
>>     
>> It seems the indexing is correct with the setting of (language,en), but the MWSearch just can't FETCH the results from the index. Is it a problem with MWSearch?
>>     
>> BTW, what's the syntex for multiple language settings in the [Database] section?
>> Is it ...
>> [Database]
>> wikidb : (single) (spell,4,2) (language,en)
>> wikidb : (single) (spell,4,2) (language,ja)
>>     
>> or:
>> [Database]
>> wikidb : (single) (spell,4,2) (language,en) (language,ja)
>>     
>> or just:
>> [Database]
>> wikidb : (single) (spell,4,2) (language,en)
>>     
>> Many thanks for any idea!
>> Ross Xu
>>
>>
>> --- On Fri, 4/16/10, Ross Xu<rossxunix at yahoo.ca>   wrote:
>>
>>
>> From: Ross Xu<rossxunix at yahoo.ca>
>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>> To: "MediaWiki announcements and site admin list"<mediawiki-l at lists.wikimedia.org>
>> Received: Friday, April 16, 2010, 4:56 AM
>>
>>
>> Thanks Robert. It's nice to know.
>>     
>> Is there any plan to support database prefixes in the future?
>> I am just curious why it does not repeat English words or letter, and just repeat Japanese characters. If there were no Japanese characters in my wiki sites, my configurations would work perfectly.
>>     
>> BTW, what does it look like in your "two lines in the [Database] section instead of one"?
>> Thanks again,
>> Ross
>>
>>
>> --- On Thu, 4/15/10, Robert Stojnic<rainmansr at gmail.com>   wrote:
>>
>>
>> From: Robert Stojnic<rainmansr at gmail.com>
>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>> To: "MediaWiki announcements and site admin list"<mediawiki-l at lists.wikimedia.org>
>> Received: Thursday, April 15, 2010, 6:34 AM
>>
>>
>>
>> Lucene-search currently doesn't support database prefixes and i imagine
>> this is why your hacked-up setup doesn't work. The extension is designed
>> so that all of the wikis are can be indexed and searched by a single
>> daemon, you just need to merge the appropriate sections (i.e. have two
>> lines in the [Database] section instead of one). A way to go for you
>> might be to separately dump both of wikis and then import them with
>> different names, but lucene-search is not designed to run this kind of
>> the setup out of the box.
>>
>> r.
>>
>> Ross Xu wrote:
>>
>>      
>>> Hi there,
>>> I don't think it's related, but this wiki site is a second one in the same machine.
>>>     
>>> The two wiki sites share the same MySQL database with different prefixes, and share the same MediaWiki source using symbolic links.
>>> The first one (called wiki1) is working well with Lucene-Search 2.1/MWSearch for searching Japanese without any problem. I even make use of (language,en), instead of (language,ja) in the global conf file.
>>> The second one (called wiki2) is using Lucene-Search 2.1/MWSearch in different ports (Search.port=8124, and Index.port=8322). It's working well with searching any English keywords, but for searching Japanese characters, it gets each Japanese character repeated in the snippets. And I have to use (language,ja), otherwise, I couldn't get any Japanese search result at all.
>>>     
>>> Even though I kill all lsearchd daemons, re-do the index (build) for wiki2, restart the lsearchd for wiki2, it's still the same thing.
>>> Any idea is appreciated,
>>> Ross
>>>
>>>
>>> --- On Wed, 4/14/10, Ross Xu<rossxunix at yahoo.ca>   wrote:
>>>
>>>
>>> From: Ross Xu<rossxunix at yahoo.ca>
>>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>>> To: "MediaWiki announcements and site admin list"<mediawiki-l at lists.wikimedia.org>
>>> Received: Wednesday, April 14, 2010, 11:52 PM
>>>
>>>
>>> Hi Robert,
>>> Thanks a bunch for your prompt reply.
>>> I have changed the global conf file, and re-run the build, but Japanese characters are still repeated. Here is what my lsearch-global.conf file looks like:
>>> ------------------
>>> [Database]
>>> wikidb: (single) (spell,4,2) (language,ja)
>>>     
>>> [Search-Group]
>>> <my hostname>   : wikidb wikidb.spell
>>>     
>>> [Index]
>>> <my hostname>   : *
>>>     
>>> [Index-Path]
>>> <default>   : /search
>>> ...
>>> ------------------
>>>
>>> I know I can't search for single character, but I can ONLY search for 2 characters now. As mentioned earlier, because the repetition, searching for more than 2 characters (e.g. 3 characters) can not get matched.
>>> Any more ideas?
>>> Thanks again,
>>> Ross
>>>     
>>>
>>> --- On Wed, 4/14/10, Robert Stojnic<rainmansr at gmail.com>   wrote:
>>>
>>>
>>> From: Robert Stojnic<rainmansr at gmail.com>
>>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>>> To: "MediaWiki announcements and site admin list"<mediawiki-l at lists.wikimedia.org>
>>> Received: Wednesday, April 14, 2010, 11:13 PM
>>>
>>>
>>>
>>> The repetition of characters is a known bug with highlighting, so you
>>> will need to disable it. As for searching, as I said you are stuck with
>>> only being able to search for two or more characters, not single
>>> characters because this is how text analysis currently works.
>>>
>>> To turn off lucene highlighting use the following your global conf file:
>>>
>>> [Search-Group]
>>> <put your host name here>: wikidb wikidb.spell
>>>
>>> Cheers, r.
>>>
>>> Ross Xu wrote:
>>>
>>>
>>>        
>>>> Thank you, Robert.
>>>> I found the problem ...
>>>> Each Japanese character gets repeated in the search results. For example, if I search for "関数", I get this in the result list:
>>>> ----------------------------------------------
>>>> Showing below results 1 - 20 of 111
>>>> View (previous 20) (next 20) (20 | 50 | 100 | 250 | 500)
>>>>
>>>>
>>>>
>>>> Checkers:BSTR.FUNC.LEN/ja
>>>> SysStringLen 関数数ままたたは SysStringByteLen 関数数をを使使用用しして、BSTR 以外外のの文文字字列列のの長長ささをを取取得得ししよよううととししてていいまます。 これれららのの関関数数のの唯唯一一のの引引数数は、BSTR、CComBSTR、 …
>>>> 669 B (160 words) - 13:42, 30 September 2009
>>>>       
>>>>
>>>> Checkers:VOIDRET/ja
>>>> void 型のの関関数数ととししてて宣宣言言さされれたた関関数数がが値値をを返返ししてていいまます。 脆弱弱性性ととリリススク : スタタイイルル関関連連のの問問題題でです。 例 : void A:foo()5 6 return 0; // 値が void 関数数かからら返返さされれる 7 …
>>>> 444 B (91 words) - 20:17, 1 February 2010
>>>> ...
>>>> ----------------------------------------------
>>>>
>>>> It doesn't repeat English letter or word.
>>>> So, if I only search for 2 Japanese characters, I can get them from the result because the 2 characters can still be matched from the repeated characters.
>>>> If I search for more than 2 characters, it can't find anything.
>>>>       
>>>> What causes this problem, and how to fix?
>>>> Thanks again,
>>>> Ross
>>>>
>>>> --- On Thu, 4/8/10, Robert Stojnic<rainmansr at gmail.com>   wrote:
>>>>
>>>>
>>>> From: Robert Stojnic<rainmansr at gmail.com>
>>>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>>>> To: "MediaWiki announcements and site admin list"<mediawiki-l at lists.wikimedia.org>
>>>> Received: Thursday, April 8, 2010, 2:15 AM
>>>>
>>>>
>>>>
>>>> The config file looks good. To further debug steps:
>>>>
>>>> 1) make sure the search you are making is showing up in the
>>>> lucene-search log, if you just start the deamon with ./lsearchd that
>>>> would be the console
>>>> 2) make sure you have all MWSearch that matches your MediaWiki version
>>>> 3) note that you cannot search for a single japanese character, but only
>>>> for 2 or more
>>>>
>>>> r.
>>>>
>>>> Ross Xu wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Thanks for your reply, Robert.
>>>>> But I did rebuild the index (./build) after changing (language,en) to (language,ja). It still doesn't work. There is no problem with searching any English keywords.
>>>>>
>>>>> You mentioned "add (language,ja)". Did you mean to ADD (language,ja) besides (language,en)?
>>>>> The entry is like this in my lsearch-global.conf file:
>>>>> [Database]
>>>>> wikidb : (single) (spell,4,2) (language,ja)
>>>>>       
>>>>> Any more ideas?
>>>>> Thanks again,
>>>>> Ross
>>>>>
>>>>> --- On Wed, 4/7/10, Robert Stojnic<rainmansr at gmail.com>   wrote:
>>>>>
>>>>>
>>>>> From: Robert Stojnic<rainmansr at gmail.com>
>>>>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>>>>> To: "MediaWiki announcements and site admin list"<mediawiki-l at lists.wikimedia.org>
>>>>> Received: Wednesday, April 7, 2010, 6:50 PM
>>>>>
>>>>>
>>>>>
>>>>> There is a rather limited support, but it does work (e.g. see
>>>>> ja.wikipedia.org). Don't forget to rebuild your index (./build) after
>>>>> you add (language,ja).
>>>>>
>>>>> r.
>>>>>
>>>>> Ross Xu wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Hi there,
>>>>>> I am using Lucene-Search 2.1/MWSearch for my MediaWiki 1.15.1.
>>>>>> It's working fine, but it can't search any Japanese characters.
>>>>>> I have tried (language,ja) in the lsearch-global.conf file, but it doesn't seem to make any difference.
>>>>>> Any idea would be appreciated,
>>>>>> Ross
>>>>>>
>>>>>>



More information about the MediaWiki-l mailing list