I'm just starting to implement search in the new codebase--the last
major piece of the puzzle not there.
I just found another annoyance and I don't know if there's a
workaround or not. MySQL treats the ' character as part of a
word. It's odd that it's otherwise so restrictive but allows
that one--only letters, digits, underscore, and '. The upshot
of this in wikitext is that if the only appearance of a word
in an article is in bold or italics, i.e., ''like this'', MySQL
will index "'''word'''", but not "word", and
so a search for "word"
will fail. So, ironically, it throws away references which we
have specifically emphasized.
Is there a way to change this behavior of MySQL, or is this one
more reason to give up and pre-preocess the whole text of each
article like we do for cur_ind_title?
Show replies by thread
My opinion is that we'd be better off "rolling our own" as I did with
the old perl software, using dbm files and updating the search engine
The nice thing about using MySQL as the search engine is that we get it
"for free". The data is in the database anyway, so why not just let MySQL
But the downside is that we have to more or less accept all the quirks
(or lack of quirks!) that come with the MySQL behavior.
If we roll our own, then we get to preprocess the text in any way we
like, plus we get to *score* the results in any way we like. We can
do neat things like taking care of the wiki syntax optimally. We can
do things (in theory) like scoring an article slightly higher as a
search result if lots of other articles link to it. We can give higher
points to italicized words, maybe.
We can choose to handle singulars and plurals in any way we prefer.
All of these things will be a more or less empirical matter, but my
own experience suggests that some tiny clever tricks can massively
improve the relevance to the end user.