On Thu, May 23, 2002 at 09:33:21PM +0200, Jan.Hidders wrote:
PS. I'd still favour a move to PostgreSQL though, which would probably be a cleaner solution and has also some other benifits. But before we do that I would like to be very very sure that fulltext indexing (which is now I believe the only MySQL-specific thing we use) works well there.
An interesting alternative might be to do the indexing ourselves. This may sound very inefficient but let me explain. One method/trick that was used in PgSQL to get a fulltext index was to create someting like a separate table
text_index( search_word, document_id )
with a conventional index on the column search_word. Now, every time a row is added to the table that you want to be searchable you retrieve all the search words (and their suffixes) and add the corresponding pairs to the table text_index. So if the text in document with id '5' is "Text Indexing" you add the following pairs to table text_index:
( "text", 5 ) ( "ext", 5 ) ( "xt", 5 ) ( "indexing", 5 ) ( "ndexing", 5 ) ...
If you now want to search for an article containing "dex" you can query the text_index table and use in your WHERE clause something like
"search_word LIKE 'dex%'"
In this case the index for the search_word column is used because the LIKE expression doesn't start with a wild-card. As you can see this will also give you the partial matches. In PgSQL this is easy to implement because you can define triggers that call certain functions whenever a row is inserted or deleted in a table. But also in MySQL we could do this, although we would then have to ensure ourselves that every time an article is created/updated/deleted the text_index table is also updated.
The nice part is that we would have total control over what is indexed and how it is indexed in the form of a PHP function that determines which pairs are added to text_index. This function could take care of collation and could even do nifty things like taking entity references into account or let a-umlaut match with ae. The different Wikipedias would probably all have a localized version of this function.
The down-side is of course some performance when articles are updates and the extra work we need to do on writing this PHP function (although initial version could be quite simple).
-- Jan Hidders