On Thu, May 23, 2002 at 09:33:21PM +0200, Jan.Hidders wrote:
PS. I'd still favour a move to PostgreSQL though, which would probably be a
cleaner solution and has also some other benifits. But before we do that I
would like to be very very sure that fulltext indexing (which is now I
believe the only MySQL-specific thing we use) works well there.
An interesting alternative might be to do the indexing ourselves. This may
sound very inefficient but let me explain. One method/trick that was used in
PgSQL to get a fulltext index was to create someting like a separate table
text_index( search_word, document_id )
with a conventional index on the column search_word. Now, every time a row
is added to the table that you want to be searchable you retrieve all the
search words (and their suffixes) and add the corresponding
pairs to the table text_index. So if the text in document with id '5' is
"Text Indexing" you add the following pairs to table text_index:
( "text", 5 )
( "ext", 5 )
( "xt", 5 )
( "indexing", 5 )
( "ndexing", 5 )
...
If you now want to search for an article containing "dex" you can query the
text_index table and use in your WHERE clause something like
"search_word LIKE 'dex%'"
In this case the index for the search_word column is used because the LIKE
expression doesn't start with a wild-card. As you can see this will also
give you the partial matches. In PgSQL this is easy to implement because you
can define triggers that call certain functions whenever a row is inserted
or deleted in a table. But also in MySQL we could do this, although we
would then have to ensure ourselves that every time an article is
created/updated/deleted the text_index table is also updated.
The nice part is that we would have total control over what is indexed and
how it is indexed in the form of a PHP function that determines which pairs
are added to text_index. This function could take care of collation and
could even do nifty things like taking entity references into account or let
a-umlaut match with ae. The different Wikipedias would probably all have a
localized version of this function.
The down-side is of course some performance when articles are updates and
the extra work we need to do on writing this PHP function (although initial
version could be quite simple).
-- Jan Hidders