One of the users of WikipediaNL found that when one searches for the word 'het' on WikipediaNL, only hits in the article title are found, none in the article text. When doing a similar search on the English version, it behaves normal. Apparently common words are being excluded from full-text searches.
My questions: * What are the criteria to exclude such words from searches? * Is it possible to get for a language a list of such excluded words?
Andre Engels
On Wed, 5 Mar 2003, Andre Engels wrote:
One of the users of WikipediaNL found that when one searches for the word 'het' on WikipediaNL, only hits in the article title are found, none in the article text. When doing a similar search on the English version, it behaves normal. Apparently common words are being excluded from full-text searches.
My questions:
- What are the criteria to exclude such words from searches?
* words shorter than two characters (this is configurable, we have it set to two; so single-letters will not be searchable) * words in the stopword list, which is built in to mysql and English-specific. * words present in more than half of the search set
'Het' is likely present in more than half of pages on the nl wiki; the fulltext search considers such extremely common words to be essentially useless for narrowing down a search and so ignores them. Meanwhile on the title index, 'het' is probably much less often present, so there it will show up.
Unfortunately, our pre-parsing doesn't know which will be caught this way, so a multi-word search including "het" on the page index will return no results, because we 'and' each word's result sets together.
- Is it possible to get for a language a list of such excluded words?
The English stopword list is duplicated so the preparsing can filter them out: 'FulltextStoplist.php' in the wiki source tree.
I'm not sure if it's possible to extract a list of words-that-will-be- ignored on a given table from the fulltext index.
-- brion vibber (brion @ pobox.com)
wikipedia-l@lists.wikimedia.org