Search update frequency - Wikitech-l - lists.wikimedia.org

List overview All Threads
Download

Search update frequency

Setting the preference "Justify...

new article tracker toy

Tim Starling

8 Nov 2006 8 Nov '06

12:06 a.m.

We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so. -- Tim Starling

Reply

Show replies by date

Brion Vibber

8 Nov 8 Nov

11:01 a.m.

Tim Starling wrote:

We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.

Thanks, Tim! -- brion vibber (brion @ pobox.com)

Reply

Steve Bennett

1:52 p.m.

On 11/8/06, Tim Starling <tstarling(a)wikimedia.org> wrote:

We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.

Would now be a good time to ask questions about what the search index algorithm is and possible improvements to it? It works okay, but sometimes the bleedingly obvious match isn't even on the first page. Example: Search for "tchaikovsky's piano concerto" (before I put the redirect in). The correct match "Piano Concerto No. 1 (Tchaikovsky)" is miles down the page, even though it has the three search terms (minus an 's) in the title. Search still seems to miss pages where the title is the *exact* search term, too. "Sounds-like" searches would be nice too - very often you look up stuff that you don't know how to spell, but may get the pronunciation close. Accentless searches would be great, too. It gets really tedious making redirects like "Spisska Kapitula -> Spišská Kapitula". And of course, resolving this whole distinction between "searches" and "gotos" would be nice :) Steve

Reply

Tim Starling

3:39 p.m.

Steve Bennett wrote:

On 11/8/06, Tim Starling <tstarling(a)wikimedia.org> wrote:

We've now dedicated a server to search index updates. I've got them running in two threads, one for enwiki and one for everything else. Each one is a loop, we should get a complete index update once every 30 hours or so.

Would now be a good time to ask questions about what the search index algorithm is and possible improvements to it? It works okay, but sometimes the bleedingly obvious match isn't even on the first page. Example: Search for "tchaikovsky's piano concerto" (before I put the redirect in). The correct match "Piano Concerto No. 1 (Tchaikovsky)" is miles down the page, even though it has the three search terms (minus an 's) in the title. Search still seems to miss pages where the title is the *exact* search term, too. "Sounds-like" searches would be nice too - very often you look up stuff that you don't know how to spell, but may get the pronunciation close. Accentless searches would be great, too. It gets really tedious making redirects like "Spisska Kapitula -> Spišská Kapitula". And of course, resolving this whole distinction between "searches" and "gotos" would be nice :)

Yes, this is a good time to ask about it. I've been thinking about this myself for the last few days, and I'm glad to hear a different perspective on it. The main question in my mind is what to index. Currently, there's a function in the updater called StripWiki(), which appears to be based on the wikitext stripping function from the MediaWiki core. It strips out things like tables, the target half of piped links, a few random HTML tags and lots of punctuation. I suspect this whole process is entirely unnecessary. I'm considering removing StripWiki altogether, and just indexing the raw wikitext. The tokenizer will strip punctuation characters by itself. There's lots of useful data in tables, I don't see why you would want to stop people from searching in them. And stripping out everything from [[Image:]] tags except for the caption seems like a waste of time. There are two other benefits from getting rid of StripWiki: firstly, 45% of index update time is currently spent in StripWiki(). This is partly due to its inefficient implementation, but I don't want to have to rewrite it if it's useless. The second benefit is that you can use lucene to do hit context highlighting. Currently, we have our own context highlighter in MediaWiki, which doesn't match up at all well with what Lucene is actually matching. It doesn't understand phrase searches, for instance. There are some potential problems, such as trying to find the word "span" in a document without getting lots of hits from syntax, or phrase searching across piped links. Another quite different option is to index the HTML. This is potentially more expensive, but avoids the complexity of reimplementing half the parser to strip wikitext. It allows you to search documents with templates expanded, and markup can easily be stripped out with a stock HTML filter. Accentless search should be easy, assuming mono has an NFKD implementation or something similar. I believe fuzzy search is similar, although I'm not as familiar with the specific technology. Scoring problems (such as your Tchaikovsky example) and missing hits should be dealt with as bugs, please report them. Most complaints of missing hits in the past have been due to slow index updates, so be careful to rule that out before you make a report. As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases. -- Tim Starling

Reply

Jay R. Ashworth

4:11 p.m.

On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote:

As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.

Second. All in favor? Cheers, -- jra -- Jay R. Ashworth jra(a)baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 "That's women for you; you divorce them, and 10 years later, they stop having sex with you." -- Jennifer Crusie; _Fast_Women_

Reply

David Gerard

4:37 p.m.

On 08/11/06, Jay R. Ashworth <jra(a)baylink.com> wrote:

On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote:

> As for search and go, well my preference for a long time has been that we > should remove the "search" button from the sidebar, to make it harder to > find and thus reduce the request rate. Go is much faster than search, if it > manages to find a title match without sending you on to the full text > search. Most readers don't know the difference, and if they did, they would > click "go" in the great majority of cases.

Second. All in favor?

Not quite - make it a "go" button, but still *call* it "Search." Because something everyone looks for on coming to a site is where to search. - d.

Reply

Jay R. Ashworth

4:45 p.m.

On Wed, Nov 08, 2006 at 04:37:56PM +0000, David Gerard wrote:

On 08/11/06, Jay R. Ashworth <jra(a)baylink.com> wrote:

On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote: > As for search and go, well my preference for a long time has > been that we should remove the "search" button from the sidebar, > to make it harder to find and thus reduce the request rate. Go > is much faster than search, if it manages to find a title match > without sending you on to the full text search. Most readers don't > know the difference, and if they did, they would click "go" in the > great majority of cases.

Second. All in favor?

Not quite - make it a "go" button, but still *call* it "Search." Because something everyone looks for on coming to a site is where to search.

Excellent point. Ratified as amended. :-) And while I'm here: let me restate my pleasure with how the Mediawiki page renders on my CSS-less Blackberry, and pray it to continue that way... Cheers, -- jra -- Jay R. Ashworth jra(a)baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 "That's women for you; you divorce them, and 10 years later, they stop having sex with you." -- Jennifer Crusie; _Fast_Women_

Reply

Brianna Laugher

4:51 p.m.

On 09/11/06, Jay R. Ashworth <jra(a)baylink.com> wrote:

On Thu, Nov 09, 2006 at 02:39:27AM +1100, Tim Starling wrote:

As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.

Second. All in favor?

I am not sure it would be a good idea for Commons - although I guess most searches would not have a direct match in the main namespace, so maybe the end result would be the same - but I think that Commons is different enough to the other Wikimedia wikis to need a specialised search function. Can we plead a special case? :) Maybe it would be better to wait for the "image implementation overhaul" to happen, no idea when that might be. But even a gallery for search results could be useful. Some functionality like MediaSearch [1] would be a godsend... regards, Brianna user:pfctdayelise [1]: http://tools.wikimedia.de/~daniel/WikiSense/MediaSearch.php?wikifam=commons…

Reply

Nick Jenkins

9 Nov 9 Nov

1:14 a.m.

Brianna Laugher wrote:

But even a gallery for search results could be useful. Some functionality like MediaSearch [1] would be a godsend... [1]: http://tools.wikimedia.de/~daniel/WikiSense/MediaSearch.php?wikifam=commons…

Wow, that is a really nice tool. Being able to put in your username and see a table of images in a gallery layout, all nicely tagged with licence information, is great. Only things I'd like to see changed or added are: * Some of the "orphan" tags aren't 100% right (I know some of the images listed as orphans aren't). * I'd like it to allow searching the image file name, and image description text as well, as I have in the past had difficulty finding images when I only had a rough idea of their name. All the best, Nick.

Reply

Brianna Laugher

1:29 a.m.

On 09/11/06, Nick Jenkins <nickpj(a)gmail.com> wrote:

Brianna Laugher wrote:

But even a gallery for search results could be useful. Some functionality like MediaSearch [1] would be a godsend... [1]: http://tools.wikimedia.de/~daniel/WikiSense/MediaSearch.php?wikifam=commons…

Wow, that is a really nice tool. Being able to put in your username and see a table of images in a gallery layout, all nicely tagged with licence information, is great. Only things I'd like to see changed or added are: * Some of the "orphan" tags aren't 100% right (I know some of the images listed as orphans aren't).

That is only on Commons. Therefore please add any such images to galleries or categories, on Commons. :) Brianna

Reply

Nick Jenkins

1:24 a.m.

As for search and go, well my preference for a long time has been that we should remove the "search" button from the sidebar, to make it harder to find and thus reduce the request rate. Go is much faster than search, if it manages to find a title match without sending you on to the full text search. Most readers don't know the difference, and if they did, they would click "go" in the great majority of cases.

Agreed, most of the time people are searching on article title name, and they don't know the exact name, or they have misspelt it. Alternatively, keep the "search" function, but before the search results, have a "Did you mean?" section, like a google search, where the did-you-mean suggestions are based on similarity of search words to known article titles and/or dictionary words. Further, if enough people click a "did you mean" suggestion for the same search term, then that's probably a good hint that we need to add a redirect from that search term to what they actually wanted. All the best, Nick.

Reply

6385

days inactive

6386

days old

wikitech-l@lists.wikimedia.org

Manage subscription

10 comments

7 participants

tags (0)

participants (7)

Brianna Laugher
Brion Vibber
David Gerard
Jay R. Ashworth
Nick Jenkins
Steve Bennett
Tim Starling