On Sun, Mar 10, 2002 at 09:28:46AM -0800, Jimmy Wales wrote:
Brion L. VIBBER wrote:
Rather, if we're going to eliminate "useless" search terms, we should have a (per-language) list of such words.
A useful and simple (though not perfect) measure of uselessness is how many pages are returned for a given word. In English, 'a', 'an' and 'the' will appear in nearly every article. In Japanese, 'wa' and other similar marker words will appear in nearly every article.
I'm wondering how is search going to work in Japanese. Not only some articles are in romaji and other kanjis, but kanji words usually aren't separated by whitesace, so it might be a bit difficult.
The more articles that are returned for a given search term, the less informative it is.
Only if they are not sorted. We could just give less priority to more frequent word and more priority to less frequent or something like that. So for example search for "the foo" would rate +1 point for every the and +10 for every foo (which is 10x less frequent than "the"). And then sort the results according to this score.