On Sun, Mar 10, 2002 at 09:28:46AM -0800, Jimmy Wales wrote:
Brion L. VIBBER wrote:
Rather, if we're going to eliminate
"useless" search terms, we should
have a (per-language) list of such words.
A useful and simple (though not perfect) measure of uselessness is how
many pages are returned for a given word. In English, 'a', 'an' and
'the' will appear in nearly every article. In Japanese, 'wa' and
other similar marker words will appear in nearly every article.
I'm wondering how is search going to work in Japanese.
Not only some articles are in romaji and other kanjis,
but kanji words usually aren't separated by whitesace,
so it might be a bit difficult.
The more articles that are returned for a given search
term, the less
informative it is.
Only if they are not sorted.
We could just give less priority to more frequent word and more priority
to less frequent or something like that. So for example search for "the foo"
would rate +1 point for every the and +10 for every foo (which is 10x less
frequent than "the"). And then sort the results according to this score.