Faster! Thanks! - Wikitech-l - lists.wikimedia.org

List overview All Threads
Download

Faster! Thanks!

about changing edited text

Slow parser

Larry Sanger

13 Feb 2002 13 Feb '02

7:42 p.m.

Just as the announcements page says, Wikipedia is a lot faster now. Thanks to everyone who made it possible. A quick-loading Recent Changes page is a beautiful thing. Larry

Reply

Show replies by date

Jan Hidders

13 Feb 13 Feb

11:24 p.m.

New subject: update: new MySQL search index committed to CVS

L.S., I have just committed a rewritten version of the search engine to CVS. It is now based upon the fulltext indexes as is offered by MySQL. In order to be able to use this the database schema has to be extended a little. The table types have changed to myISAM (MySQL own version of ISAM), there is an extra redundant column, and two fulltext indexes have to be added.. The commands to do this have been added to "updSchema.sql", just uncomment the commands you still have to do. The resulting database scheme should be as in "wikipedia.sql". This new search engine is somewhat a mixed blessing. First the good news + The main reason for introducing it is that the search queries were reported as the slowest by the database, so they are eating up a lot of database resources. So not only will it make searches faster but the whole of Wikipedia will probably benefit from its introduction. + The search engine tries to estimate the relevance of pages wrt. to the given search words and it is in this order that they are presented to the user. Not sorted alphabetically, as now. + With the arrival of MySQL4 there will be new search possibilities such as boolean searches and natural language searches. And now for the bad news. - I had to introduce an extra redundant column contain a duplicate of cur_title. This is because the fulltext index cannot be defined on binary columns. - If you give the search engine multiple words it will also regards multiple occurrences of one word already very relevant. So the term "Larry Sanger" will also lead you to pages with only a lot of "Larry" on it. - It does not search on small words of three letters or less. So the therm "war" gives you zero results. - It searches in the raw HTML so it doesn't know that Gödel and Gödel are the same. That's it for now. My next task will be the MostWanted Page. -- Jan Hidders

Reply

Jimmy Wales

11:30 p.m.

New subject: update: new MySQL search index committed to CVS

One cute trick that I have often used is to calculate the "uselessness" of a particular word. A word is semantically more useless if it appears more often. This has really dramatic empirical results for the better, especially on small datasets. (Maybe on really big ones, too, but I've never played with those.) Thus if someone searches for 'John Malkovich' they get a good result, because 'John' is not weighted so heavily -- it's a more useless word because it appears more often in the search set. But 'Malkovich', now you're talking, there's a word that _means something_.

Reply

8108

days inactive

8108

days old

wikitech-l@lists.wikimedia.org

Manage subscription

2 comments

3 participants

tags (0)

participants (3)

Jan Hidders
Jimmy Wales
Larry Sanger