Hello,
I wrote a MediaWiki extension to search wikicontent using Hyper Estraier, a full-text search engine.
Description has been posted at http://meta.wikimedia.org/wiki/Hyper_Estraier_extension
About Hyper Estraier: * Sourceforge: http://hyperestraier.sourceforge.net/ * OS Reviews: http://www.osreviews.net/reviews/misc/hyperestraier
And I set up a demo site uses this extension to search Japanese Wikipedia content. http://ja.wikipedia.tietew.jp/wiki/Special:Search
Look'n'feel is derived from LuceneSearch.php. Search results and content summary are generated by Hyper Estraier search server.
== Background == Current wikipedia's lucene-search is tooo bad about searching Japanese. I think Hyper Estraier is better for languages whose sentence is not separated by space, such as Japanese and Chinese.
see also: http://meta.wikimedia.org/wiki/Summer_of_Code_2006#I18n_search_index
## I have no idea which engine is better for europian languages.
== Request == I want to test and demo real-time index update from live Wikipedia. Could you please give me a permittion to access OAIRepository?
== Comparation ==
word: "ウィキペディア" (Wikipedia) Hyper Estraier: 490 hits http://ja.wikipedia.tietew.jp/w/index.php?title=%E7%89%B9%E5%88%A5:Search&am... http://ja.wikipedia.tietew.jp/w/index.php?title=%E7%89%B9%E5%88%A5:Search&am... Lucene: 397 hits ... but only 40 documents are shown http://ja.wikipedia.org/w/index.php?title=%E7%89%B9%E5%88%A5:Search&sear... http://ja.wikipedia.org/w/index.php?title=%E7%89%B9%E5%88%A5:Search&sear...
word: "百科事典" (hyakka-jiten; encyclopedia) Hyper Estraier: 352 hits http://ja.wikipedia.tietew.jp/w/index.php?title=%E7%89%B9%E5%88%A5:Search&am... http://ja.wikipedia.tietew.jp/w/index.php?title=%E7%89%B9%E5%88%A5:Search&am... Lucene: 171 hits ... but only 23 documents are shown http://ja.wikipedia.org/wiki/%E7%89%B9%E5%88%A5:Search?search=%E7%99%BE%E7%A... http://ja.wikipedia.org/wiki/%E7%89%B9%E5%88%A5:Search?search=%E7%99%BE%E7%A...
Thank you.
Tietew wrote: [snip]
Look'n'feel is derived from LuceneSearch.php.
I'd recommend instead using the search plugin system built into 1.5 and later; see extensions/MWSearch for the Lucene interface for that. It might need a little more tweaking, but will be smoother to replace things with in future than the old hacked-up LuceneSearch.php.
Next week I'll be testing out Sphinx (http://sphinxsearch.com/) which is a GPL'd fulltext search engine, which at least according to its authors is faster than Lucene -- much faster at indexing! They don't however currently have appropriate tokenizing for CJK presently. We could see about adding that, but I'd certainly love to compare it to something else like Estraier if it's available.
== Request == I want to test and demo real-time index update from live Wikipedia. Could you please give me a permittion to access OAIRepository?
I'll set this up for you this weekend. Make sure you've got a fast connection and plenty of space. :)
-- brion vibber (brion @ pobox.com)
On Fri, 28 Apr 2006 02:29:13 -0700 In article 4451E069.6000501@pobox.com [Re: [Wikitech-l] Hyper Estraier extension] Brion Vibber brion@pobox.com wrote:
Tietew wrote: [snip]
Look'n'feel is derived from LuceneSearch.php.
I'd recommend instead using the search plugin system built into 1.5 and later; see extensions/MWSearch for the Lucene interface for that. It might need a little more tweaking, but will be smoother to replace things with in future than the old hacked-up LuceneSearch.php.
I could not customize summary generation with MWSearch. Hyper Estraier can generates good summary itself.
Class SearchEngine (or SpecialSearch?) should have summary generator hook or overrides.
Next week I'll be testing out Sphinx (http://sphinxsearch.com/) which is a GPL'd fulltext search engine, which at least according to its authors is faster than Lucene -- much faster at indexing! They don't however currently have appropriate tokenizing for CJK presently. We could see about adding that, but I'd certainly love to compare it to something else like Estraier if it's available.
Hyper Estraier uses N-gram method for CJK language. Additionaly, it can use "MeCab" Morphological Analyzer (dictionary-based Japanese tokenizer; http://mecab.sourceforge.jp/) for keyword search.
Mr.Hirabayashi, the author of Hyper Estraier, says he supports and cooperates with me, and Wikipedia.
Hyper Estraier has its own HTTP-based P2P protocol. We can construct distributed search cluster very easily.
For example, my demo site is composed of two machine, an Apache with MediaWiki on Linux, and Hyper Estraier search server (node master) on Windows XP.
== Request == I want to test and demo real-time index update from live Wikipedia. Could you please give me a permittion to access OAIRepository?
I'll set this up for you this weekend. Make sure you've got a fast connection and plenty of space. :)
Thank you!
wikitech-l@lists.wikimedia.org