Hi,
I've been using the existing Mediawiki search engine and implemented a docfile search based on the filesearch extension (running the doc thru antiword). I realize that wikipedia is now lucence, but I have some suggestions to improve the mysql search.
First off I noticed the maintenance rebuildTextIndexes.php has a bug that it doesn't index any namespace other than main. It also needs text on the page so I make the following hack (line 59):
$u = new SearchUpdate( $s->page_id, Title::makeName($s->page_namespace,$s->page_title), $revtext); if($u->mNamespace == NS_IMAGE && !$u->mText ) $u->mText = "File"; // Always have some text for images to force indexing
This allows it to index files with no text, and ensures the namespace.
Also the MySQL ranking is not working at the moment:
$m2 = str_replace(" IN BOOLEAN MODE", "", $match);
$m2 = str_replace("+", "", $m2);
SELECT page_id, page_namespace, page_title, {$m2} as relevance FROM $page, .$searchindex WHERE page_id=si_masterid AND $match
I've replaced this query with a hacked multiwiki one that shows rank, so I hope that makes sense!
This tip was from http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html , the second comment
Both might be a nice addition for the core engine, for people without lucene...
Final fix was to the filesearch extension - it should return true, or subsequent indexing extensions break extensions.
One last question: when updating the index, should I hook the ondeletepage to remove an index or should there be another hook somewhere else?
Best regards,
Alex