The search code in the new codebase behaves similar to our current code in that it assumes an implicit AND for several search terms, and doesn't return any results if no articles match all terms.
I wonder if this is the intuitive behavior for most users. I think Google has conditioned people to type in as much relevant information as possible to get better hits, and most search engines work that way. In fact, the built-in mysql search code works that way too. Maybe we should use it directly? That way, we could also present the results according to relevancy (which mysql reports), rather than alphabetically.
We would lose the boolean AND OR NOT operators, but newer versions of mysql have substitutes: you use "+term" if you definitely want the term in your results, and you use "-term" if you definitely don't want it. This is almost as powerful as boolean searching.
Alternatively, we could have an "advanced search" page where you could construct a boolean search, include/exclude specific namespaces etc. Now that I think about it, a way to optionally search talk: and wikipedia: would probably be desirable.
Axel
On Wed, Jun 12, 2002 at 04:53:17PM +0200, Axel Boldt wrote:
I wonder if this is the intuitive behavior for most users. I think Google has conditioned people to type in as much relevant information as possible to get better hits, and most search engines work that way. In fact, the built-in mysql search code works that way too. Maybe we should use it directly?
I tried that at the time but wasn't really satisfied with the way MySQL did the scoring. If you looked for "A B" it would score an article with only lots of A's much higher than an article with a few combinations of A and B. That's why I came up with the idea of boolean search.
Having said that I now favour removing my search code and moving to MySQL's binary search, because if you don't like it's default scoring you can now use the +'s. It was fun writing a parser for boolean expressions but if we can get rid of that complicated piece of code and defer some of the work to the database I'm all for it. Simplify, simplify.
That way, we could also present the results according to relevancy (which mysql reports), rather than alphabetically.
In my original code the sorting was based on the scoring. That's what happens in MySQL by default anyway.
We would lose the boolean AND OR NOT operators, but newer versions of mysql have substitutes: you use "+term" if you definitely want the term in your results, and you use "-term" if you definitely don't want it. This is almost as powerful as boolean searching.
Not almost, completely, you can use brackets. :-)
However, on the long run we should probably implement our own indexing. That would allow us to tackle several problems: - the ' problem - searching UTF-8 with proper collation without hacking the character set - recognizing entities such as ö - languages with inflections - partial matches or ... we could wait for the MySQL team to implement the Generic user-suppliable UDF preparser as us mentioned in their to-do list. Perhaps we should give them a call. :-)
-- Jan Hidders
wikipedia-l@lists.wikimedia.org