> Having said that I now favour removing my search code and moving
> to MySQL's binary search, because if you don't like it's default
> scoring you can now use the +'s. It was fun writing a parser for
> boolean expressions but if we can get rid of that complicated
> piece of code and defer some of the work to the database I'm all
> for it. Simplify, simplify.
The first simplification I did was to get rid of the parser because
it wasn't necessary--SQL is already doing it, so I just pass on the
ANDs, ORs, and NOTs as they are. Yes, I put an implicit AND between
terms, because that makes fast, small result sets.
The boolean searching in MySQL 4.0 would be great--but that's a
BIG leap--MySQL 4.0 is not a stable product. It's alpha software,
and I'm not so sure that giving up the reliability of 3.23 is worth
the extra features. Does anyone on the list have experience with
MySQL 4.0 in a production environment?
MySQL 3.23 is very stable and reliable. Even recompiling it from
source was simple (I did that to get rid of the 4-letter miniumum--
you can check that out at the new site--search for "PVC" for example).
The second change I made to the search was to parse the article
text into a separate field the way we were already doing for titles.
This field contains all the unique words of the article just once,
case folded and stripped from punctuation (so it fixes the ''
problem, for example). I even do some processing for things
like [[game]]s, which will put both "game" and "games" in the index.
We could expand this preprocessing to do some things.
We could also do our own scoring after MySQL returns the raw
results, but that would require making a pass through the entire
result set before displaying anything. Another thing about the
search in the new codebase is that it is blindingly fast--it can
return results in within 2 seconds many times. When it's that
fast, you don't need as many features because the user can do
multiple searches.
> However, on the long run we should probably implement our own
> indexing. That would allow us to tackle several problems:
> - the ' problem
> - searching UTF-8 with proper collation without hacking the
> character set
> - recognizing entities such as ö
> - languages with inflections
All of these can be solved with the pre-processing already in
the new codebase--in fact the ' problem is already solved. I
haven't done anything with new character sets, but that should
be pretty easy--take a look at SearchUpdate.php.
> - partial matches or ... we could wait for the MySQL team to
> implement the Generic user-suppliable UDF preparser as us
> mentioned in their to-do list. Perhaps we should give them a
> call. :-)
I'm really big on stable, reliable software. Even if MySQL
chose to implement something like that, I wouldn't recommend
using it until it had been in production for a few months, and
we can't even say that of 4.0 yet.
0