Robert Stojnic wrote:
Hi all,
I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:
Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages. Highlights:
- prefix searches. E.g. entering help:images in the search box will search
only the help namespace
- search categories. You can limit search by category. e.g. clarinet
incategory:"woodwind instruments"
- improved scoring. Default lucene scoring favors short articles, I tried
to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search. Top two articles have almost the same score, first one: Commodity (Marxism) is a long article about usage of the word in Marxism, and other: Commodity is an article that is much shorter but whose title fits more accurately.
Test index is based on latest dumps for 15 largest wikis, with updates from last 4-5 days.
Any feedback will be appreciated :)
Robert
http://en.wikipedia.org/wiki/Commodity_%28Marxism%29 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."