Lars Aronsson wrote:
This is a software problem that should not be exposed to the user. The four letter limit is stupid, and should be raised to allow three letter words. If my search queary contains a whitespace, each separate word could still be sent to the MySQL search engine, and the resulting hit lists can be joined so that pages containing both words are listed ahead of pages that contain only one of the words.
I concur.
In my own personal experience writing the search engine for Bomis and the old fastcgi Wikipedia search engine, it is not particularly costly to have on-the-fly scoring logic that's suitable for the problem at hand. Even some very ad hoc measures (scoring titles higher than the body is very powerful for Wikipedia, for example... at Bomis, I have a measure of "uselessness" for words that does a nice job of helping relevancy over a raw keyword search) can be hugely beneficial in having the search results cause joy in the searcher.
I personally wonder about the raw performance of MySQL as compared to btree dbm files. I could, in theory, write a perl script to go through and index the current wikipedia database once per night, using some of my "tricks of the trade", and make the search engine a lot better than it is now.
However, I wonder if that's the right approach. The downside to doing it my old-fashioned way is that the search engine only updates as often as I set the cron to update the search engine index. If the data is always "live" in the mysql database, and if it's just as fast, then that's obviously the better way to do it.
I don't have any actual knowledge, I just know that fascgi perl + DB_File btree stuff is very fast. (Bomis handles about 100x the traffic on 3x the servers!)
--Jimbo