I'm just starting to implement search in the new codebase--the last major piece of the puzzle not there.
I just found another annoyance and I don't know if there's a workaround or not. MySQL treats the ' character as part of a word. It's odd that it's otherwise so restrictive but allows that one--only letters, digits, underscore, and '. The upshot of this in wikitext is that if the only appearance of a word in an article is in bold or italics, i.e., ''like this'', MySQL will index "'''word'''", but not "word", and so a search for "word" will fail. So, ironically, it throws away references which we have specifically emphasized.
Is there a way to change this behavior of MySQL, or is this one more reason to give up and pre-preocess the whole text of each article like we do for cur_ind_title?
0
My opinion is that we'd be better off "rolling our own" as I did with the old perl software, using dbm files and updating the search engine nightly.
The nice thing about using MySQL as the search engine is that we get it "for free". The data is in the database anyway, so why not just let MySQL handle it?
But the downside is that we have to more or less accept all the quirks (or lack of quirks!) that come with the MySQL behavior.
If we roll our own, then we get to preprocess the text in any way we like, plus we get to *score* the results in any way we like. We can do neat things like taking care of the wiki syntax optimally. We can do things (in theory) like scoring an article slightly higher as a search result if lots of other articles link to it. We can give higher points to italicized words, maybe.
We can choose to handle singulars and plurals in any way we prefer.
All of these things will be a more or less empirical matter, but my own experience suggests that some tiny clever tricks can massively improve the relevance to the end user.
--Jimbo
wikitech-l@lists.wikimedia.org