On Wed, 5 Mar 2003, Andre Engels wrote:
One of the users of WikipediaNL found that when one
searches for the word 'het'
on WikipediaNL, only hits in the article title are found, none in the article
text. When doing a similar search on the English version, it behaves normal.
Apparently common words are being excluded from full-text searches.
My questions:
* What are the criteria to exclude such words from searches?
* words shorter than two characters (this is configurable, we have it set
to two; so single-letters will not be searchable)
* words in the stopword list, which is built in to mysql and
English-specific.
* words present in more than half of the search set
'Het' is likely present in more than half of pages on the nl wiki; the
fulltext search considers such extremely common words to be essentially
useless for narrowing down a search and so ignores them. Meanwhile on the
title index, 'het' is probably much less often present, so there it will
show up.
Unfortunately, our pre-parsing doesn't know which will be caught this way,
so a multi-word search including "het" on the page index will return no
results, because we 'and' each word's result sets together.
* Is it possible to get for a language a list of such
excluded words?
The English stopword list is duplicated so the preparsing can filter them
out: 'FulltextStoplist.php' in the wiki source tree.
I'm not sure if it's possible to extract a list of words-that-will-be-
ignored on a given table from the fulltext index.
-- brion vibber (brion @
pobox.com)