Having said that I now favour removing my search code and moving to MySQL's binary search, because if you don't like it's default scoring you can now use the +'s. It was fun writing a parser for boolean expressions but if we can get rid of that complicated piece of code and defer some of the work to the database I'm all for it. Simplify, simplify.
The first simplification I did was to get rid of the parser because it wasn't necessary--SQL is already doing it, so I just pass on the ANDs, ORs, and NOTs as they are. Yes, I put an implicit AND between terms, because that makes fast, small result sets.
The boolean searching in MySQL 4.0 would be great--but that's a BIG leap--MySQL 4.0 is not a stable product. It's alpha software, and I'm not so sure that giving up the reliability of 3.23 is worth the extra features. Does anyone on the list have experience with MySQL 4.0 in a production environment?
MySQL 3.23 is very stable and reliable. Even recompiling it from source was simple (I did that to get rid of the 4-letter miniumum-- you can check that out at the new site--search for "PVC" for example).
The second change I made to the search was to parse the article text into a separate field the way we were already doing for titles. This field contains all the unique words of the article just once, case folded and stripped from punctuation (so it fixes the '' problem, for example). I even do some processing for things like [[game]]s, which will put both "game" and "games" in the index. We could expand this preprocessing to do some things.
We could also do our own scoring after MySQL returns the raw results, but that would require making a pass through the entire result set before displaying anything. Another thing about the search in the new codebase is that it is blindingly fast--it can return results in within 2 seconds many times. When it's that fast, you don't need as many features because the user can do multiple searches.
However, on the long run we should probably implement our own indexing. That would allow us to tackle several problems:
- the ' problem
- searching UTF-8 with proper collation without hacking the
character set
- recognizing entities such as ö
- languages with inflections
All of these can be solved with the pre-processing already in the new codebase--in fact the ' problem is already solved. I haven't done anything with new character sets, but that should be pretty easy--take a look at SearchUpdate.php.
- partial matches or ... we could wait for the MySQL team to
implement the Generic user-suppliable UDF preparser as us mentioned in their to-do list. Perhaps we should give them a call. :-)
I'm really big on stable, reliable software. Even if MySQL chose to implement something like that, I wouldn't recommend using it until it had been in production for a few months, and we can't even say that of 4.0 yet. 0