What is required to "fix search"?

List overview All Threads
Download

newer

older

MediaWiki automated test run...

Seagate perpendicular drives

Steve Bennett

13 Apr 2006 13 Apr '06

10:58 a.m.

Hi, Just wondering what the story with the search functionality is exactly. My biggest concern is that searches for the exact title of an existing (but "recently" created articles) fail.

For example: http://en.wikipedia.org/wiki/Special:Search?search=Cities+%28album%29&fu...

I understand that searching is a difficult problem best left to Google, but is there a way where it could at least be hacked to check for articles whose name matches the search string verbatim?

Thanks, Steve

Show replies by date

Brion Vibber

13 Apr 13 Apr

11:16 a.m.

Steve Bennett wrote:

...

Hi, Just wondering what the story with the search functionality is exactly.

Currently the index does not update on a regular basis as it caused the search servers to explode violently, bringing everything down. Until that is corrected, search is only updated sporadically.

...

I understand that searching is a difficult problem best left to Google, but is there a way where it could at least be hacked to check for articles whose name matches the search string verbatim?

Click the "go" button or hit "enter" when searching.

-- brion vibber (brion @ pobox.com)

Steve Bennett

11:26 a.m.

On 4/13/06, Brion Vibber brion@pobox.com wrote:

...

Currently the index does not update on a regular basis as it caused the search servers to explode violently, bringing everything down. Until that is corrected, search is only updated sporadically.

Ok, when could that next be done? Is there any kind of plan to fix searching at some stage? It's not like it's some minor obscure little feature...

...

Click the "go" button or hit "enter" when searching.

Yes, *I* know that...

Steve

Brion Vibber

11:40 a.m.

Steve Bennett wrote:

...

On 4/13/06, Brion Vibber brion@pobox.com wrote:

...
Currently the index does not update on a regular basis as it caused the search servers to explode violently, bringing everything down. Until that is corrected, search is only updated sporadically.

Ok, when could that next be done? Is there any kind of plan to fix searching at some stage? It's not like it's some minor obscure little feature...

Currently it works off the page data dumps.

-- brion vibber (brion @ pobox.com)

Steve Bennett

18 Apr 18 Apr

2:02 p.m.

On 13/04/06, Brion Vibber brion@pobox.com wrote:

...

...
I understand that searching is a difficult problem best left to Google, but is there a way where it could at least be hacked to check for articles whose name matches the search string verbatim?

Click the "go" button or hit "enter" when searching.

I noticed a related problem today. The Wikipedia search box in FireFox performs a "search" not a "go". Hence if you search for a term that has been defined recently, you get nuttin'.

How about this: On the search results screen, simply provide a link to the term, behaving differently if the term actually has an article behind it. Currently, you see this:

...

From Wikipedia, the free encyclopedia

You searched for "aouoeau" [Index]

However, the colour of the link is identical whether aouoeau exists or not. Perhaps it could be modified as follows:

...

From Wikipedia, the free encyclopedia

No page with that title exists. You can create this article or request it.

Results 1-20...

or alternatively

...

From Wikipedia, the free encyclopedia

The page <a href...>Swiss National Museum</a> exists.

Results 1-20...

----

That way, you get the best of both worlds, the search index can be left broken, and the functionalities of "go" and "search" are combined, with a minimum of effort.

What say you?

Steve

Jakob Voss

13 Apr 13 Apr

6:16 p.m.

Steve Bennett wrote:

...

Just wondering what the story with the search functionality is exactly. My biggest concern is that searches for the exact title of an existing (but "recently" created articles) fail.

Search engines don't update their search index live with every new item. The problem with Wikipedia is its size and the quick changes. Normally you would generate a new index every week or night - and to generate a search index for millions of records takes hours! A powerful MediaWiki search engine with a time lag of 1 to 2 days would also be fine for me - you could also think of a smart search engine that works on an old dump in the first run and checks on the live database in the second.

This is what I wrote about it last year: http://wm.sieheauch.de/?p=4 With this sketch of a special search engine for MediaWiki: http://wm.sieheauch.de/files/MediaWikiSearchEngine.html

To get such a powerful search it's better to build it from the scratch in an independent application instead of coding it into MediaWiki (but I'm no MediaWiki developer so I may be wrong) so you can optimize for searching only.

...

I understand that searching is a difficult problem best left to Google, but is there a way where it could at least be hacked to check for articles whose name matches the search string verbatim?

How about about a title search?

SELECT page_id FROM page WHERE page_title RLIKE $regxp AND $conditions LIMIT $limit

It would be useful to find articles named "FOO (BAR)", "List of FOO" etc.

Greetings, Jakob

Gregory Maxwell

7:06 p.m.

On 4/13/06, Jakob Voss jakob.voss@nichtich.de wrote: [snip]

...

How about about a title search?

SELECT page_id FROM page WHERE page_title RLIKE $regxp AND $conditions LIMIT $limit

It would be useful to find articles named "FOO (BAR)", "List of FOO" etc.

narf.

Seqscanning through a million rows on each users query? I think not.

It MySQL's full text indexing worked on anything but myisam tables it would be easy to provide your title search, but they don't.

Steve Bennett

9:07 p.m.

On 13/04/06, Jakob Voss jakob.voss@nichtich.de wrote:

...

Search engines don't update their search index live with every new item. The problem with Wikipedia is its size and the quick changes. Normally you would generate a new index every week or night - and to generate a search index for millions of records takes hours! A powerful MediaWiki search engine with a time lag of 1 to 2 days would also be fine for me - you could also think of a smart search engine that works on an old dump in the first run and checks on the live database in the second.

That would be more than fine. I gather the search db is currently several months out of date? But that wasn't my major complaint.

...

To get such a powerful search it's better to build it from the scratch in an independent application instead of coding it into MediaWiki (but I'm no MediaWiki developer so I may be wrong) so you can optimize for searching only.

Well, it should be as easily accessible as the search box is now.

...

SELECT page_id FROM page WHERE page_title RLIKE $regxp AND $conditions LIMIT $limit

That would be nice, but even the simple mechanism of exact matches would be a start. And then you can add fall backs, like all upper case, all lower case, upper case first letter of each word and so on. If performance is the issue here.

Steve

Gregory Maxwell

9:15 p.m.

On 4/13/06, Steve Bennett stevage@gmail.com wrote:

...

That would be nice, but even the simple mechanism of exact matches would be a start. And then you can add fall backs, like all upper case, all lower case, upper case first letter of each word and so on. If performance is the issue here.

All upper/lower isn't really effective on Wikipedia because most of our multiword titles are mixed case.

AFAIR, most string matches in mysql are case insensitive, which would mean that we could have indexed case insensitive matches quickly... but I'm guessing that our use of binary fields for titles (which is required because no version of mysql has complete UTF-8 support) most likely breaks that.

Alas, mysql doesn't have functional indexes (http://bugs.mysql.com/bug.php?id=4990) or it would be fairly trivial to offer a fast case-insensitive match.

Lars Aronsson

14 Apr 14 Apr

12:38 a.m.

Jakob Voss wrote:

...

Search engines don't update their search index live with every new item. The problem with Wikipedia is its size and the quick changes. Normally you would generate a new index every week or night - and to generate a search index for millions of records takes hours!

I'm not sure if you're talking about the big web search engines (Google, Yahoo, MSN) or the search function in Mediawiki here. There is little excuse for the latter to have any delay. But even for a big web search engine, it is easy to keep track of how often each webpage has changed in history, and economize how often it needs to be revisited. Combined with the high PageRank of en.wp's RecentChanges (9 of 10), it would be trivial for Googlebot to revisit this page (or the front page of websites of major newspapers) every minute or two and make it a high priority to reindex all pages linked from there. I suppose this is how Google News works. Why it still takes about a month for Google to update its index on Wikipedia articles is a mystery to me. Probably it has to do with a lack of competition. If MSN or Yahoo were faster, it would force Google to improve.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Evan Martin

12:36 p.m.

On 4/14/06, Lars Aronsson lars@aronsson.se wrote:

...

I'm not sure if you're talking about the big web search engines (Google, Yahoo, MSN) or the search function in Mediawiki here. There is little excuse for the latter to have any delay. But even for a big web search engine, it is easy to keep track of how often each webpage has changed in history, and economize how often it needs to be revisited. Combined with the high PageRank of en.wp's RecentChanges (9 of 10), it would be trivial for Googlebot to revisit this page (or the front page of websites of major newspapers) every minute or two and make it a high priority to reindex all pages linked from there. I suppose this is how Google News works. Why it still takes about a month for Google to update its index on Wikipedia articles is a mystery to me. Probably it has to do with a lack of competition. If MSN or Yahoo were faster, it would force Google to improve.

I know this was intentionally provocative, but I'll bite anyway.

As far as I know, the limitation in general on Google indexing more of wikipedia is that wikipedia can't serve pages fast enough (or, more accurately, the extra load of more Googlebot will make wikipedia slower).

To answer your specific proposal: 1) http://en.wikipedia.org/wiki/Special:Recentchanges has a meta tag: <meta name="robots" content="noindex,follow" /> which indicates it's explicitly disallowed from being crawled. 2) If it were allowed to be crawled, I'd expect it to be regularly updated for the reasons you describe. But even in that case, this particular page changing rapidly is not an indicator that the target pages are also changing rapidly. For example, I imagine that the digg front page changes pretty much every time a crawler visits, but the pages linked *from* digg are not necessarily changing any more rapidly than any other random page on the web is changing.

Instead, there is a way for webmasters and Google to cooperate: the sitemaps program. You can read more about it here: https://www.google.com/webmasters/sitemaps/docs/en/about.html

Lars Aronsson

19 Apr 19 Apr

4:07 a.m.

On April 14, Evan Martin wrote:

...

To answer your specific proposal:

http://en.wikipedia.org/wiki/Special:Recentchanges has a meta tag:

<meta name="robots" content="noindex,follow" /> which indicates it's explicitly disallowed from being crawled.

As far as I understand the robots meta tag, "noindex,follow" tells robots that they are welcome to fetch the page, that they can find links to other pages here (= follow), but they should never show this page among the search hits (= noindex).

Words such as crawl and index are somewhat fuzzy here. Does "index" mean fetch or does it mean store in an index, to be returned to users as a search hit? I found no clear answer. Of course, the crawler/robot/spider is already fetching the page when it sees the meta tag. And it must fetch the page again to see if the meta tag has changed.

The Pipermail software that is used for the wikitech-l archive sets "noindex,follow" for the overview sorted by date, e.g. http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/date.html but for the individual posting, it sets "index,nofollow", e.g. http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034969.html

I believe that "noindex,follow" is used for many "sitemap" pages, and this is my idea of how search robots should use RecentChanges.

Indeed, the front page of any newspaper website is also similar to a sitemap. Its content changes so often that it becomes useless to index it under any specific word found there. If people search for "hurricane katrina", they don't want the front page of the Washington Post, which will have changed by the time they arrive. But they might be interested in the news article about this topic, and the front page was the way to harvest the link to that article.

The main difference, then, between the newspaper and Wikipedia is that the newspaper uses their RecentChanges as their front page. Plus the fact that Wikipedia isn't covered by Google News.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

6798

Age (days ago)

6804

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

6 participants

tags (0)

participants (6)

Brion Vibber
Evan Martin
Gregory Maxwell
Jakob Voss
Lars Aronsson
Steve Bennett