L.S.,
I have just committed a rewritten version of the search engine to CVS. It is now based upon the fulltext indexes as is offered by MySQL. In order to be able to use this the database schema has to be extended a little. The table types have changed to myISAM (MySQL own version of ISAM), there is an extra redundant column, and two fulltext indexes have to be added.. The commands to do this have been added to "updSchema.sql", just uncomment the commands you still have to do. The resulting database scheme should be as in "wikipedia.sql".
This new search engine is somewhat a mixed blessing.
First the good news + The main reason for introducing it is that the search queries were reported as the slowest by the database, so they are eating up a lot of database resources. So not only will it make searches faster but the whole of Wikipedia will probably benefit from its introduction. + The search engine tries to estimate the relevance of pages wrt. to the given search words and it is in this order that they are presented to the user. Not sorted alphabetically, as now. + With the arrival of MySQL4 there will be new search possibilities such as boolean searches and natural language searches.
And now for the bad news. - I had to introduce an extra redundant column contain a duplicate of cur_title. This is because the fulltext index cannot be defined on binary columns. - If you give the search engine multiple words it will also regards multiple occurrences of one word already very relevant. So the term "Larry Sanger" will also lead you to pages with only a lot of "Larry" on it. - It does not search on small words of three letters or less. So the therm "war" gives you zero results. - It searches in the raw HTML so it doesn't know that Gödel and Gödel are the same.
That's it for now. My next task will be the MostWanted Page.
-- Jan Hidders