Re: [Wikipedia-l] Searching in new codebase

12 Jun 2002

      ...
Having said that I now favour removing my search code and moving
to MySQL's binary search, because if you don't like it's default
scoring you can now use the +'s. It was fun writing a parser for
boolean expressions but if we can get rid of that complicated
piece of code and defer some of the work to the database I'm all
for it. Simplify, simplify.
The first simplification I did was to get rid of the parser because
it wasn't necessary--SQL is already doing it, so I just pass on the
ANDs, ORs, and NOTs as they are.  Yes, I put an implicit AND between
terms, because that makes fast, small result sets.
The boolean searching in MySQL 4.0 would be great--but that's a
BIG leap--MySQL 4.0 is not a stable product.  It's alpha software,
and I'm not so sure that giving up the reliability of 3.23 is worth
the extra features.  Does anyone on the list have experience with
MySQL 4.0 in a production environment?
MySQL 3.23 is very stable and reliable.  Even recompiling it from
source was simple (I did that to get rid of the 4-letter miniumum--
you can check that out at the new site--search for "PVC" for example).
The second change I made to the search was to parse the article
text into a separate field the way we were already doing for titles.
This field contains all the unique words of the article just once,
case folded and stripped from punctuation (so it fixes the ''
problem, for example).  I even do some processing for things
like [[game]]s, which will put both "game" and "games" in the index.
We could expand this preprocessing to do some things.
We could also do our own scoring after MySQL returns the raw
results, but that would require making a pass through the entire
result set before displaying anything.  Another thing about the
search in the new codebase is that it is blindingly fast--it can
return results in within 2 seconds many times.  When it's that
fast, you don't need as many features because the user can do
multiple searches.
...
However, on the long run we should probably implement our own
indexing. That would allow us to tackle several problems:
...

the ' problem
searching UTF-8 with proper collation without hacking the

character set

recognizing entities such as ö
languages with inflections

All of these can be solved with the pre-processing already in
the new codebase--in fact the ' problem is already solved.  I
haven't done anything with new character sets, but that should
be pretty easy--take a look at SearchUpdate.php.
...

partial matches or ... we could wait for the MySQL team to

implement the Generic user-suppliable UDF preparser as us
mentioned in their to-do list. Perhaps we should give them a
call. :-)
I'm really big on stable, reliable software.  Even if MySQL
chose to implement something like that, I wouldn't recommend
using it until it had been in production for a few months, and
we can't even say that of 4.0 yet.
0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] Searching in new codebase