Steve Bennett wrote:
On 11/8/06, Tim Starling tstarling@wikimedia.org wrote:
We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.
Would now be a good time to ask questions about what the search index algorithm is and possible improvements to it? It works okay, but sometimes the bleedingly obvious match isn't even on the first page.
Example: Search for "tchaikovsky's piano concerto" (before I put the redirect in). The correct match "Piano Concerto No. 1 (Tchaikovsky)" is miles down the page, even though it has the three search terms (minus an 's) in the title. Search still seems to miss pages where the title is the *exact* search term, too.
"Sounds-like" searches would be nice too - very often you look up stuff that you don't know how to spell, but may get the pronunciation close.
Accentless searches would be great, too. It gets really tedious making redirects like "Spisska Kapitula -> Spišská Kapitula".
And of course, resolving this whole distinction between "searches" and "gotos" would be nice :)
Yes, this is a good time to ask about it. I've been thinking about this myself for the last few days, and I'm glad to hear a different perspective on it.
The main question in my mind is what to index. Currently, there's a function in the updater called StripWiki(), which appears to be based on the wikitext stripping function from the MediaWiki core. It strips out things like tables, the target half of piped links, a few random HTML tags and lots of punctuation. I suspect this whole process is entirely unnecessary. I'm considering removing StripWiki altogether, and just indexing the raw wikitext. The tokenizer will strip punctuation characters by itself. There's lots of useful data in tables, I don't see why you would want to stop people from searching in them. And stripping out everything from [[Image:]] tags except for the caption seems like a waste of time.
There are two other benefits from getting rid of StripWiki: firstly, 45% of index update time is currently spent in StripWiki(). This is partly due to its inefficient implementation, but I don't want to have to rewrite it if it's useless. The second benefit is that you can use lucene to do hit context highlighting. Currently, we have our own context highlighter in MediaWiki, which doesn't match up at all well with what Lucene is actually matching. It doesn't understand phrase searches, for instance.
There are some potential problems, such as trying to find the word "span" in a document without getting lots of hits from syntax, or phrase searching across piped links.
Another quite different option is to index the HTML. This is potentially more expensive, but avoids the complexity of reimplementing half the parser to strip wikitext. It allows you to search documents with templates expanded, and markup can easily be stripped out with a stock HTML filter.
Accentless search should be easy, assuming mono has an NFKD implementation or something similar. I believe fuzzy search is similar, although I'm not as familiar with the specific technology. Scoring problems (such as your Tchaikovsky example) and missing hits should be dealt with as bugs, please report them. Most complaints of missing hits in the past have been due to slow index updates, so be careful to rule that out before you make a report.
As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.
-- Tim Starling