The CVS contains FulltextStoplist.php, which is a list of the "common words" excluded from search queries. It contains only English words, which caused complaints on the German wikipedia, as they, at least, don't want to be kept from searching for "false friends" common in English.
It would be easy to just make it another array/function in the Language files, but 1. AFAIK, it is only used in one function, namely search 2. It might be nice if updating this list would be easy for everyone, not just developers
So, why not make it "wikipedia:Fulltext Stoplist", and load the whold list from the database on each query? Might actually save some time in the long run, and the non-English wikipedias could easily develop their own lists. Or would that be too risky?
Magnus
Magnus Manske wrote:
The CVS contains FulltextStoplist.php, which is a list of the "common words" excluded from search queries. It contains only English words, which caused complaints on the German wikipedia, as they, at least, don't want to be kept from searching for "false friends" common in English.
My understanding was that the stopwords are implemented in MySQL's indexing and search feature.
We have a list of them so that _our_ search system can filter stopwords out before they hit MySQL's search, where they behave in a way that's extremely unhelpful to us (searching a stopword by itself returns nothing; we search each word individually and then return the intersection of all the subsearches, so a word that returns no hits kills the entire search).
So, changing that list wouldn't help with false friends, as you'd just get empty search results; we'd have to change MySQL. (Which is entirely doable, I'm sure.)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org