By parsing the search string, we get the benefit of full boolean search, which is definitely cool. But our current regime has one severe downside: any search which has a single short word or stopword or non-existent word in it will automatically fail since we assume an implicit "AND" between all search terms. The unadorned mysql "MATCH" operator does not have this downside: if you search for "the chinese wall" for example, "the" will be silently ignored and you get the expected hit.
I am wondering if we can combine the best of those worlds. This should decrease the number of complaints about short search terms dramatically, maybe even to the point that we can keep the current index size.
How about this: every subsequence of the query string which doesn't contain any +/- boolean operators (see below) is passed to the MATCH operator as is, which assumes an implicit "give me the best matches you can find for these terms, the more matching terms the better". Then we could have two additional operators: + and -. If a word is preced by +, it *must* be presents, if a word is preceded by - it *cannot* be present. That allows to express any complicated query we can right now, but should result in much fewer failed searches. Does that seem feasible?
Axel
wikitech-l@lists.wikimedia.org