For CJK we are currently using a very simple tokenizer, the code is here [1]. Apparently, this is the standard way of handling CJK and a similar tokenizer is included in the lucene sandbox. If you could provide some insight into a better CJK tokenizer, we would be glad to listen :)
r.
[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search-2/src/org/wiki...
On 9/11/07, howard chen howachen@gmail.com wrote:
Hello,
One of my interest is how Wikipedia's Lucene implmentation can be used to tackle foreign languages such as Chinese or Japanese, where tokenization is more complex.
Besides, will the search engine be open sourced later?
:)
On 9/11/07, Tim Starling tstarling@wikimedia.org wrote:
howard chen wrote:
Is that the same as mediawiki, i.e. MySQL search?
Any plan to use lucene (Zend Framework) in the future?
We use Lucene Java. No there are no plans to use the Lucene in Zend Framework, I didn't know it existed until now. Zend doesn't sound to me like the best framework for building multithreaded apps.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l