We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.
-- Tim Starling
Tim Starling wrote:
We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.
Thanks, Tim!
-- brion vibber (brion @ pobox.com)
On 11/8/06, Tim Starling tstarling@wikimedia.org wrote:
We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.
Would now be a good time to ask questions about what the search index algorithm is and possible improvements to it? It works okay, but sometimes the bleedingly obvious match isn't even on the first page.
Example: Search for "tchaikovsky's piano concerto" (before I put the redirect in). The correct match "Piano Concerto No. 1 (Tchaikovsky)" is miles down the page, even though it has the three search terms (minus an 's) in the title. Search still seems to miss pages where the title is the *exact* search term, too.
"Sounds-like" searches would be nice too - very often you look up stuff that you don't know how to spell, but may get the pronunciation close.
Accentless searches would be great, too. It gets really tedious making redirects like "Spisska Kapitula -> Spišská Kapitula".
And of course, resolving this whole distinction between "searches" and "gotos" would be nice :)
Steve
Steve Bennett wrote:
On 11/8/06, Tim Starling tstarling@wikimedia.org wrote:
We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.
Would now be a good time to ask questions about what the search index algorithm is and possible improvements to it? It works okay, but sometimes the bleedingly obvious match isn't even on the first page.
Example: Search for "tchaikovsky's piano concerto" (before I put the redirect in). The correct match "Piano Concerto No. 1 (Tchaikovsky)" is miles down the page, even though it has the three search terms (minus an 's) in the title. Search still seems to miss pages where the title is the *exact* search term, too.
"Sounds-like" searches would be nice too - very often you look up stuff that you don't know how to spell, but may get the pronunciation close.
Accentless searches would be great, too. It gets really tedious making redirects like "Spisska Kapitula -> Spišská Kapitula".
And of course, resolving this whole distinction between "searches" and "gotos" would be nice :)
Yes, this is a good time to ask about it. I've been thinking about this myself for the last few days, and I'm glad to hear a different perspective on it.
The main question in my mind is what to index. Currently, there's a function in the updater called StripWiki(), which appears to be based on the wikitext stripping function from the MediaWiki core. It strips out things like tables, the target half of piped links, a few random HTML tags and lots of punctuation. I suspect this whole process is entirely unnecessary. I'm considering removing StripWiki altogether, and just indexing the raw wikitext. The tokenizer will strip punctuation characters by itself. There's lots of useful data in tables, I don't see why you would want to stop people from searching in them. And stripping out everything from [[Image:]] tags except for the caption seems like a waste of time.
There are two other benefits from getting rid of StripWiki: firstly, 45% of index update time is currently spent in StripWiki(). This is partly due to its inefficient implementation, but I don't want to have to rewrite it if it's useless. The second benefit is that you can use lucene to do hit context highlighting. Currently, we have our own context highlighter in MediaWiki, which doesn't match up at all well with what Lucene is actually matching. It doesn't understand phrase searches, for instance.
There are some potential problems, such as trying to find the word "span" in a document without getting lots of hits from syntax, or phrase searching across piped links.
Another quite different option is to index the HTML. This is potentially more expensive, but avoids the complexity of reimplementing half the parser to strip wikitext. It allows you to search documents with templates expanded, and markup can easily be stripped out with a stock HTML filter.
Accentless search should be easy, assuming mono has an NFKD implementation or something similar. I believe fuzzy search is similar, although I'm not as familiar with the specific technology. Scoring problems (such as your Tchaikovsky example) and missing hits should be dealt with as bugs, please report them. Most complaints of missing hits in the past have been due to slow index updates, so be careful to rule that out before you make a report.
As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.
-- Tim Starling
On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote:
As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.
Second.
All in favor?
Cheers, -- jra
On 08/11/06, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote:
As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.
Second. All in favor?
Not quite - make it a "go" button, but still *call* it "Search." Because something everyone looks for on coming to a site is where to search.
- d.
On Wed, Nov 08, 2006 at 04:37:56PM +0000, David Gerard wrote:
On 08/11/06, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote:
As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.
Second. All in favor?
Not quite - make it a "go" button, but still *call* it "Search." Because something everyone looks for on coming to a site is where to search.
Excellent point.
Ratified as amended. :-)
And while I'm here: let me restate my pleasure with how the Mediawiki page renders on my CSS-less Blackberry, and pray it to continue that way...
Cheers, -- jra
On 09/11/06, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote:
As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.
Second.
All in favor?
I am not sure it would be a good idea for Commons - although I guess most searches would not have a direct match in the main namespace, so maybe the end result would be the same - but I think that Commons is different enough to the other Wikimedia wikis to need a specialised search function. Can we plead a special case? :)
Maybe it would be better to wait for the "image implementation overhaul" to happen, no idea when that might be. But even a gallery for search results could be useful. Some functionality like MediaSearch [1] would be a godsend...
regards, Brianna user:pfctdayelise
[1]: http://tools.wikimedia.de/~daniel/WikiSense/MediaSearch.php?wikifam=commons....
Brianna Laugher wrote:
But even a gallery for search results could be useful. Some functionality like MediaSearch [1] would be a godsend... [1]: http://tools.wikimedia.de/~daniel/WikiSense/MediaSearch.php?wikifam=commons....
Wow, that is a really nice tool. Being able to put in your username and see a table of images in a gallery layout, all nicely tagged with licence information, is great.
Only things I'd like to see changed or added are: * Some of the "orphan" tags aren't 100% right (I know some of the images listed as orphans aren't). * I'd like it to allow searching the image file name, and image description text as well, as I have in the past had difficulty finding images when I only had a rough idea of their name.
All the best, Nick.
On 09/11/06, Nick Jenkins nickpj@gmail.com wrote:
Brianna Laugher wrote:
But even a gallery for search results could be useful. Some functionality like MediaSearch [1] would be a godsend... [1]: http://tools.wikimedia.de/~daniel/WikiSense/MediaSearch.php?wikifam=commons....
Wow, that is a really nice tool. Being able to put in your username and see a table of images in a gallery layout, all nicely tagged with licence information, is great.
Only things I'd like to see changed or added are:
- Some of the "orphan" tags aren't 100% right (I know some of the images listed as orphans aren't).
That is only on Commons. Therefore please add any such images to galleries or categories, on Commons. :)
Brianna
As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.
Agreed, most of the time people are searching on article title name, and they don't know the exact name, or they have misspelt it.
Alternatively, keep the "search" function, but before the search results, have a "Did you mean?" section, like a google search, where the did-you-mean suggestions are based on similarity of search words to known article titles and/or dictionary words. Further, if enough people click a "did you mean" suggestion for the same search term, then that's probably a good hint that we need to add a redirect from that search term to what they actually wanted.
All the best, Nick.
wikitech-l@lists.wikimedia.org