Our search engine desperately needs retooling. If there's no objection from those in the know, I'd like to migrate us to MySQL 4. The fulltext search in 4 has boolean capabilities built right in, meaning we could remove our hackish and buggy parser, and wouldn't need to stack so many MATCHes together in a query when some poor sap types in "chemical composition of the earth's atmosphere oxygen nitrogen" or something.
(Our search queries are also frequently *dog slow*. This is exacerbated because, being a myisam table, it locks when someone tries to write it and another read is pending. I don't _think_ this lock virulently spreads to other tables joined with it, but it's annoying anyway.)
Other things to think about:
* Stopwords. Can we just get rid of the damn stopwords and search anything?
* "Title results" vs "Text results" - this two-prong approach is, I think, rather confusing. We could have a single search index field with the title text weighted more heavily (by repetition?), and just give a single set of results.
* Text extracts: these show the raw wikicode, and often include language links, HTML code, etc. Yuck! If we can strip these, that might be good.
* Character entities: should be folded to their raw equivalents in the search index, so searching a page containing "Schrödinger" and one containing "Schrödinger" gives identical results.
* 'Power search' is perhaps a little confusing, and there's currently no way to get to it short of doing two searches.
* 'Search' and 'go' buttons are not clearly demarcated; several people have noted confusion. Better labelling or better arrangement is needed.
* Redirects. We generally want to filter out redirects that seem duplicative of other things already listed, but *must* show them for alternate names. Clearer labeling of redirects would help as well.
-- brion vibber (brion @ pobox.com)
On Mit, 2003-01-29 at 10:19, Brion Vibber wrote:
Our search engine desperately needs retooling. If there's no objection from those in the know, I'd like to migrate us to MySQL 4. The fulltext search in 4 has boolean capabilities built right in, meaning we could remove our hackish and buggy parser, and wouldn't need to stack so many MATCHes together in a query when some poor sap types in "chemical composition of the earth's atmosphere oxygen nitrogen" or something.
As a tempfix, we could match against 'phrase' for any phrase that doesn't contain OR or NOT, no?
(Our search queries are also frequently *dog slow*. This is exacerbated because, being a myisam table, it locks when someone tries to write it and another read is pending. I don't _think_ this lock virulently spreads to other tables joined with it, but it's annoying anyway.)
If Jimbo has some money to spend, he should give it to InnoDB and ask them to implement the damn FULLTEXT index: http://www.innodb.com/todo.html
Failing that, we might think about delaying index updates. Ugly, though. Also, split up the join as we discussed. If we're really freaky, we could move the searchindex to a separate PostgreSQL database, perhaps as part of the phase IV (or was it V?) transition.
- Stopwords. Can we just get rid of the damn stopwords and search
anything?
Absolutely in favor!
- "Title results" vs "Text results" - this two-prong approach is, I
think, rather confusing. We could have a single search index field with the title text weighted more heavily (by repetition?), and just give a single set of results.
Not sure, I always liked the distinction. Has anyone complained about this?
- Text extracts: these show the raw wikicode, and often include language
links, HTML code, etc. Yuck! If we can strip these, that might be good.
Yes!
- Character entities: should be folded to their raw equivalents in the
search index, so searching a page containing "Schrödinger" and one containing "Schrödinger" gives identical results.
Right.
- 'Power search' is perhaps a little confusing, and there's currently no
way to get to it short of doing two searches.
- 'Search' and 'go' buttons are not clearly demarcated; several people
have noted confusion. Better labelling or better arrangement is needed.
I'm afraid that in the limited space we have, we can't really do much better. "Go" is fairly obvious when you use it, and I can't think of a better label. With the new matching (namespace handling could be improved), it's really darn useful.
We might want to add a small "Advanced search" link below, in another column of the row where the interlanguage links are shown.
- Redirects. We generally want to filter out redirects that seem
duplicative of other things already listed, but *must* show them for alternate names. Clearer labeling of redirects would help as well.
Well, I thought about a syntax like
#redirect [[foo]] (reason)
We could then show this nicely in the search results as
"Redirects to page foo. Reason: spelling error."
Also, on the actual page
"Redirected from bar. Reason: spelling error."
However, by allowing freetext here, we will get lots of different non-standardized texts, which is bad. I'd rather have some standard texts defined in Language.php and have these referenced with shorthands like
"sp" - spelling error "old" - older spelling "tra" - naming convention: anglicization/transliteration "acr" - naming convention: acronyms "plu" - naming convention: pluralization "com" - naming convention: common name "nam" - naming convention: names and titles "sty" - naming conventions - style, general "dis" - disambiguation "ndis" - unique title, no disambiguation needed
These labels should always be as specific as possible, i.e. not just "alternative title", but refer to the correct naming convention. The texts could, in fact, link to the proper Wikipedia articles. This would help readers understand why we are redirecting where, expose more people to our policies, and allow better presentation of seach results. We could define for each of them whether they should be included in the search or not (I think that "nam" and "dis" should not be included.)
These labels would not be hardcoded anywhere but in LanguageXY.php, i.e. they would not be auto-inherited by other languages. So every Wikipedia could set its own policies and shortcuts.
Regards,
Erik
Brion Vibber wrote:
Our search engine desperately needs retooling.
This is a welcome innitiative.
Other things to think about:
- Stopwords. Can we just get rid of the damn stopwords and search
anything?
A very few may still need to be there, but with the opportunity to override.
- "Title results" vs "Text results" - this two-prong approach is, I
think, rather confusing. We could have a single search index field with the title text weighted more heavily (by repetition?), and just give a single set of results.
I believe in options. Perhaps a checkbox if one only wants to look for titles. A 'titles only' search will naturally be much faster, and may be all that is needed.
- Text extracts: these show the raw wikicode, and often include language
links, HTML code, etc. Yuck! If we can strip these, that might be good.
For the general search I agree. Still an opt-in to all that is very helpful when we are looking for things to edit.
- Character entities: should be folded to their raw equivalents in the
search index, so searching a page containing "Schrödinger" and one containing "Schrödinger" gives identical results.
Also "Schrodinger" without an umlaut, etc..
- 'Power search' is perhaps a little confusing, and there's currently no
way to get to it short of doing two searches.
I guess I'm just one of those luddites that's never distinguished between a search and a power search.
- 'Search' and 'go' buttons are not clearly demarcated; several people
have noted confusion. Better labelling or better arrangement is needed.
- Redirects. We generally want to filter out redirects that seem
duplicative of other things already listed, but *must* show them for alternate names. Clearer labeling of redirects would help as well.
See my answer to 3.
Eclecticology
wikitech-l@lists.wikimedia.org