Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mohamed Magdy wrote:
Robert Stojnic wrote:
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."
The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.
I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words
Roughly speaking, stemming is the process of taking inflected forms of words ("category" -> "categories") and extracting a normalized root form (say, "categori") for comparison purposes. This allows you to search on one form and receive results containing the other.
Thanks for explaining !
The exact code to do this will vary depending on language. A number of preexisting stemming filters exist for Lucene's indexing engine, some of which are used here.
Our currently-live search does basic stemming for English, German, Russian, and Esperanto, but not for other languages.
Comes the question, when/how other languages will have stemming as well?
If it is a bit annoying, but what is the difference between the basic and advanced stemming? or it will be added?
The issue Robert mentioned was that the old way would return results with any inflected form unconditionally, which can be annoying when you really did want an exact match. The new code has a preference for exact matches in the ranking, but will return related inflected forms as well.
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGUyxCwRnhpk1wk44RAs+VAKCmkfbxCS2KhCfXP5IANjfDpOJAQwCeLr3B h31LTAQFL6WLz8M1gcM/FZ0= =sqlV -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l