Steve Bennett wrote:
On 11/8/06, Tim Starling
<tstarling(a)wikimedia.org> wrote:
We've now dedicated a server to search index
updates. I've got them running
in two threads, one for enwiki and one for everything else. Each one is a
loop, we should get a complete index update once every 30 hours or so.
Would now be a good time to ask questions about what the search index
algorithm is and possible improvements to it? It works okay, but
sometimes the bleedingly obvious match isn't even on the first page.
Example: Search for "tchaikovsky's piano concerto" (before I put the
redirect in). The correct match "Piano Concerto No. 1 (Tchaikovsky)"
is miles down the page, even though it has the three search terms
(minus an 's) in the title. Search still seems to miss pages where the
title is the *exact* search term, too.
"Sounds-like" searches would be nice too - very often you look up
stuff that you don't know how to spell, but may get the pronunciation
close.
Accentless searches would be great, too. It gets really tedious making
redirects like "Spisska Kapitula -> Spišská Kapitula".
And of course, resolving this whole distinction between "searches" and
"gotos" would be nice :)
Yes, this is a good time to ask about it. I've been thinking about this
myself for the last few days, and I'm glad to hear a different perspective
on it.
The main question in my mind is what to index. Currently, there's a function
in the updater called StripWiki(), which appears to be based on the wikitext
stripping function from the MediaWiki core. It strips out things like
tables, the target half of piped links, a few random HTML tags and lots of
punctuation. I suspect this whole process is entirely unnecessary. I'm
considering removing StripWiki altogether, and just indexing the raw
wikitext. The tokenizer will strip punctuation characters by itself. There's
lots of useful data in tables, I don't see why you would want to stop people
from searching in them. And stripping out everything from [[Image:]] tags
except for the caption seems like a waste of time.
There are two other benefits from getting rid of StripWiki: firstly, 45% of
index update time is currently spent in StripWiki(). This is partly due to
its inefficient implementation, but I don't want to have to rewrite it if
it's useless. The second benefit is that you can use lucene to do hit
context highlighting. Currently, we have our own context highlighter in
MediaWiki, which doesn't match up at all well with what Lucene is actually
matching. It doesn't understand phrase searches, for instance.
There are some potential problems, such as trying to find the word "span" in
a document without getting lots of hits from syntax, or phrase searching
across piped links.
Another quite different option is to index the HTML. This is potentially
more expensive, but avoids the complexity of reimplementing half the parser
to strip wikitext. It allows you to search documents with templates
expanded, and markup can easily be stripped out with a stock HTML filter.
Accentless search should be easy, assuming mono has an NFKD implementation
or something similar. I believe fuzzy search is similar, although I'm not as
familiar with the specific technology. Scoring problems (such as your
Tchaikovsky example) and missing hits should be dealt with as bugs, please
report them. Most complaints of missing hits in the past have been due to
slow index updates, so be careful to rule that out before you make a report.
As for search and go, well my preference for a long time has been that we
should remove the "search" button from the sidebar, to make it harder to
find and thus reduce the request rate. Go is much faster than search, if it
manages to find a title match without sending you on to the full text
search. Most readers don't know the difference, and if they did, they would
click "go" in the great majority of cases.
-- Tim Starling