Hi Robert,
Thanks a lot for your kindness, and your reply!
But I don't exactly know what the "remedy" is. Did you mean the
"[Search-Group]" section in the lsearch-global.conf file? I tried the setup for
the [Search-Group] section, but there was no difference. I can't find where to set up
[SearchGroup] under "MediaWiki" instead of the lucene backend.
BTW, do you know which search tool the Wikipedia is using? I can search Janapese at
Wikipedia without problem.
Thanks again,
Ross Xu
--- On Wed, 9/29/10, Robert Stojnic<rainmansr(a)gmail.com> wrote:
From: Robert Stojnic<rainmansr(a)gmail.com>
Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
To: "MediaWiki announcements and site admin
list"<mediawiki-l(a)lists.wikimedia.org>
Received: Wednesday, September 29, 2010, 8:09 PM
Hi Ross,
1) The character doubling in CJK languages is a known bug, and the only
remedy right now is to have MediaWiki (instead of the lucene backend) do
the snippet extraction. To do this, make you setup like this:
[SearchGroup]
your_host : wikidb wikidb.spell
(instead of using wildcard character "*". Note the wikidb.hl which
contains the snippet highlighting information is missing)
2) There is currently no way of specifying multiple languages. In the
case of english/japanese it might be easy to distinguish them, but in
the general case it is quite difficult. As a result, this hasn't been
implemented.
As for it being 5 months, no-one is working on improving the search for
more than a year. There has been the occasional bugfixing and
maintenance, but no major features like the ones you're suggesting.
No-one is contracted or paid to do it, and I don't have any more free
time to put into it. So, I'm afraid that you are on your own on this,
and if you want things moving for the japanese search, you will have to
do it yourself.
Cheers, Robert
On 28/09/10 20:30, Ross Xu wrote:
Hi there,
It's been 5 months, but I am still having this problem.
I have upgraded my MW to 1.16.0, and upgraded my MWSearch for 1.16.0. The Lucene-Search
is still the newest 2.1.3.
If I set (language,ja) in the [Database] section in the lsearch-global.conf file, run
./build, and search for "Checkers:BSTR.FUNC.LEN", I get:
##########################
Checkers:BSTR.FUNC.LEN
Attempt to get length of non-BSTR string using SysStringLen or SysStringByteLen ... 1
void bstr_len() 2 wchar_t a L"abc";3 int l SysStringLen(a ... 530 B (82 words) -
13:42, 30 September 2009
Checkers:BSTR.FUNC.LEN/ja
SysStringLen 関数数ままたたは SysStringByteLen 関数数をを使使用用しして、BSTR 以外外のの文文字字列列のの長長ささをを取取得得ししよ ... 1
void bstr_len() 2 wchar_t a L"abc";3 int l ... 669 B (160 words) - 13:42, 30
September 2009
View (previous 20 | next 20) (20 | 50 | 100 | 250 | 500)
##########################
As you can see, I get two pages/items in the search results - one is English version, and
the other is Japanese version, and each Japanese character is repeated. But I get nothing
if I search for any Japanese characters (e.g. "関数").
If I set (language,en) in the [Database] section in the lsearch-global.conf file, run
./build, and search for "Checkers:BSTR.FUNC.LEN", I get the same results (two
pages - one English version, and one Japanese version), but Japanese characters are NOT
repeated. However, I still can't search for any Japanese characters (e.g.
"関数").
It seems the indexing is correct with the setting of (language,en), but the MWSearch just
can't FETCH the results from the index. Is it a problem with MWSearch?
BTW, what's the syntex for multiple language settings in the [Database] section?
Is it ...
[Database]
wikidb : (single) (spell,4,2) (language,en)
wikidb : (single) (spell,4,2) (language,ja)
or:
[Database]
wikidb : (single) (spell,4,2) (language,en) (language,ja)
or just:
[Database]
wikidb : (single) (spell,4,2) (language,en)
Many thanks for any idea!
Ross Xu
--- On Fri, 4/16/10, Ross Xu<rossxunix(a)yahoo.ca> wrote:
From: Ross Xu<rossxunix(a)yahoo.ca>
Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
To: "MediaWiki announcements and site admin
list"<mediawiki-l(a)lists.wikimedia.org>
Received: Friday, April 16, 2010, 4:56 AM
Thanks Robert. It's nice to know.
Is there any plan to support database prefixes in the future?
I am just curious why it does not repeat English words or letter, and just repeat
Japanese characters. If there were no Japanese characters in my wiki sites, my
configurations would work perfectly.
BTW, what does it look like in your "two lines in the [Database] section instead of
one"?
Thanks again,
Ross
--- On Thu, 4/15/10, Robert Stojnic<rainmansr(a)gmail.com> wrote:
From: Robert Stojnic<rainmansr(a)gmail.com>
Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
To: "MediaWiki announcements and site admin
list"<mediawiki-l(a)lists.wikimedia.org>
Received: Thursday, April 15, 2010, 6:34 AM
Lucene-search currently doesn't support database prefixes and i imagine
this is why your hacked-up setup doesn't work. The extension is designed
so that all of the wikis are can be indexed and searched by a single
daemon, you just need to merge the appropriate sections (i.e. have two
lines in the [Database] section instead of one). A way to go for you
might be to separately dump both of wikis and then import them with
different names, but lucene-search is not designed to run this kind of
the setup out of the box.
r.
Ross Xu wrote:
> Hi there,
> I don't think it's related, but this wiki site is a second one in the same
machine.
>
> The two wiki sites share the same MySQL database with different prefixes, and share
the same MediaWiki source using symbolic links.
> The first one (called wiki1) is working well with Lucene-Search 2.1/MWSearch for
searching Japanese without any problem. I even make use of (language,en), instead of
(language,ja) in the global conf file.
> The second one (called wiki2) is using Lucene-Search 2.1/MWSearch in different ports
(Search.port=8124, and Index.port=8322). It's working well with searching any English
keywords, but for searching Japanese characters, it gets each Japanese character repeated
in the snippets. And I have to use (language,ja), otherwise, I couldn't get any
Japanese search result at all.
>
> Even though I kill all lsearchd daemons, re-do the index (build) for wiki2, restart
the lsearchd for wiki2, it's still the same thing.
> Any idea is appreciated,
> Ross
>
>
> --- On Wed, 4/14/10, Ross Xu<rossxunix(a)yahoo.ca> wrote:
>
>
> From: Ross Xu<rossxunix(a)yahoo.ca>
> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
> To: "MediaWiki announcements and site admin
list"<mediawiki-l(a)lists.wikimedia.org>
> Received: Wednesday, April 14, 2010, 11:52 PM
>
>
> Hi Robert,
> Thanks a bunch for your prompt reply.
> I have changed the global conf file, and re-run the build, but Japanese characters
are still repeated. Here is what my lsearch-global.conf file looks like:
> ------------------
> [Database]
> wikidb: (single) (spell,4,2) (language,ja)
>
> [Search-Group]
> <my hostname> : wikidb wikidb.spell
>
> [Index]
> <my hostname> : *
>
> [Index-Path]
> <default> : /search
> ...
> ------------------
>
> I know I can't search for single character, but I can ONLY search for 2
characters now. As mentioned earlier, because the repetition, searching for more than 2
characters (e.g. 3 characters) can not get matched.
> Any more ideas?
> Thanks again,
> Ross
>
>
> --- On Wed, 4/14/10, Robert Stojnic<rainmansr(a)gmail.com> wrote:
>
>
> From: Robert Stojnic<rainmansr(a)gmail.com>
> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
> To: "MediaWiki announcements and site admin
list"<mediawiki-l(a)lists.wikimedia.org>
> Received: Wednesday, April 14, 2010, 11:13 PM
>
>
>
> The repetition of characters is a known bug with highlighting, so you
> will need to disable it. As for searching, as I said you are stuck with
> only being able to search for two or more characters, not single
> characters because this is how text analysis currently works.
>
> To turn off lucene highlighting use the following your global conf file:
>
> [Search-Group]
> <put your host name here>: wikidb wikidb.spell
>
> Cheers, r.
>
> Ross Xu wrote:
>
>
>
>> Thank you, Robert.
>> I found the problem ...
>> Each Japanese character gets repeated in the search results. For example, if I
search for "関数", I get this in the result list:
>> ----------------------------------------------
>> Showing below results 1 - 20 of 111
>> View (previous 20) (next 20) (20 | 50 | 100 | 250 | 500)
>>
>>
>>
>> Checkers:BSTR.FUNC.LEN/ja
>> SysStringLen 関数数ままたたは SysStringByteLen 関数数をを使使用用しして、BSTR
以外外のの文文字字列列のの長長ささをを取取得得ししよよううととししてていいまます。 これれららのの関関数数のの唯唯一一のの引引数数は、BSTR、CComBSTR、 …
>> 669 B (160 words) - 13:42, 30 September 2009
>>
>>
>> Checkers:VOIDRET/ja
>> void 型のの関関数数ととししてて宣宣言言さされれたた関関数数がが値値をを返返ししてていいまます。 脆弱弱性性ととリリススク :
スタタイイルル関関連連のの問問題題でです。 例 : void A:foo()5 6 return 0; // 値が void 関数数かからら返返さされれる 7 …
>> 444 B (91 words) - 20:17, 1 February 2010
>> ...
>> ----------------------------------------------
>>
>> It doesn't repeat English letter or word.
>> So, if I only search for 2 Japanese characters, I can get them from the result
because the 2 characters can still be matched from the repeated characters.
>> If I search for more than 2 characters, it can't find anything.
>>
>> What causes this problem, and how to fix?
>> Thanks again,
>> Ross
>>
>> --- On Thu, 4/8/10, Robert Stojnic<rainmansr(a)gmail.com> wrote:
>>
>>
>> From: Robert Stojnic<rainmansr(a)gmail.com>
>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>> To: "MediaWiki announcements and site admin
list"<mediawiki-l(a)lists.wikimedia.org>
>> Received: Thursday, April 8, 2010, 2:15 AM
>>
>>
>>
>> The config file looks good. To further debug steps:
>>
>> 1) make sure the search you are making is showing up in the
>> lucene-search log, if you just start the deamon with ./lsearchd that
>> would be the console
>> 2) make sure you have all MWSearch that matches your MediaWiki version
>> 3) note that you cannot search for a single japanese character, but only
>> for 2 or more
>>
>> r.
>>
>> Ross Xu wrote:
>>
>>
>>
>>
>>> Thanks for your reply, Robert.
>>> But I did rebuild the index (./build) after changing (language,en) to
(language,ja). It still doesn't work. There is no problem with searching any English
keywords.
>>>
>>> You mentioned "add (language,ja)". Did you mean to ADD
(language,ja) besides (language,en)?
>>> The entry is like this in my lsearch-global.conf file:
>>> [Database]
>>> wikidb : (single) (spell,4,2) (language,ja)
>>>
>>> Any more ideas?
>>> Thanks again,
>>> Ross
>>>
>>> --- On Wed, 4/7/10, Robert Stojnic<rainmansr(a)gmail.com> wrote:
>>>
>>>
>>> From: Robert Stojnic<rainmansr(a)gmail.com>
>>> Subject: Re: [Mediawiki-l] Does Lucene-Search Support Japanese?
>>> To: "MediaWiki announcements and site admin
list"<mediawiki-l(a)lists.wikimedia.org>
>>> Received: Wednesday, April 7, 2010, 6:50 PM
>>>
>>>
>>>
>>> There is a rather limited support, but it does work (e.g. see
>>>
ja.wikipedia.org). Don't forget to rebuild your index (./build) after
>>> you add (language,ja).
>>>
>>> r.
>>>
>>> Ross Xu wrote:
>>>
>>>
>>>
>>>
>>>
>>>> Hi there,
>>>> I am using Lucene-Search 2.1/MWSearch for my MediaWiki 1.15.1.
>>>> It's working fine, but it can't search any Japanese characters.
>>>> I have tried (language,ja) in the lsearch-global.conf file, but it
doesn't seem to make any difference.
>>>> Any idea would be appreciated,
>>>> Ross
>>>>
>>>>
>>>>
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org