Re: [Mediawiki-l] Best full-text search engine?

6 Dec 2007


      I would like to include myself as someone who knows what they are talking
about but it might be a stretch!
Anyhow, we implemented Lucene on a wiki with approaching 350k pages. The
performance and types of search available are impressive and we are using
the existing search tables in MySQL to feed the Lucene index. We perofrm an
incremental synchronization every 15 minutes and the indexing is fast with a
couple of thousand documents being indexed per minute. It is unlikely we
will have that many in a 15 minute period so all should be fine.
Things like language stemming, fishing returns fish and fishes etc, multiple
language support, prioritization based upon various factors (title,
occurrences of words near each other) and a host of other features.
Memory is indeed a consideration but even so we are able to run this on a
1GB dedicated server and still see search response times well under 200mS.
We will be moving to a bigger server but not as a result of Lucene.
If you look at Lucene then also look to implement Solr which provides added
functionality as well as the standard highlight search term within the
teaser results and transaction management amongst multiple index servers.
If you want to talk to the best in the industry go to sematext
(www.sematext.com) which is run by Otis, one of the original participants in
Lucene. We are VERY happy with our move to Lucene and will be adding Solr in
the next couple of weeks.
Hope some of this helps.
Regards,
Paul
On 12/5/07 3:35 PM, "Jim Hu" jimhu@tamu.edu wrote:
...
I've been thinking about moving from the default to Lucene, and am NOT
an expert, so take the following with lots of NaCl.  I'd like to hear
what people who know what they're talking about think!
As I understand it, Lucene indexes and stores the indexes into a set
of index files that are kept in memory or are swapped in as needed and
does not use the backend database that's running the wiki.  By
contrast, Sphinx works via mySQL.  I believe that this difference can
be important as the size and use of the wiki increases, since the
search can end up taxing the db leading to performance degradation for
mySQL.  But if Lucene sucks up all your free memory, you could get
performance problems outside mySQL.  This is probably not an issue for
your setup behind a firewall, but I'm wondering how to think about the
tradeoffs for a smallish single-server wiki that sometimes gets
swamped by search engine hits. And yes, I know that I need to learn
more about robots.txt too...
Google also sells search appliances, in case you really want it to
exactly like Google.  ;)
Jim
On Dec 5, 2007, at 1:57 PM, Emufarmers Sangly wrote:
...
On Dec 5, 2007 10:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
...
I have a mediawiki based resourced that needs a full text search
engine.
Google will not work as it is not yet a public resource.  Anyone
have any
recommendations?  This is intended to be used at an academic
institution.
Lucene http://www.mediawiki.org/wiki/Extension:LuceneSearch is what
Wikipedia uses.  You might also want to take a look at Sphinx <
http://www.mediawiki.org/wiki/Extension:SphinxSearch%3E.
-- 
Arr, ye emus, http://emufarmers.com
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
=====================================
Jim Hu
Associate Professor
Dept. of Biochemistry and Biophysics
2128 TAMU
Texas A&M Univ.
College Station, TX 77843-2128
979-862-4054

MediaWiki-l mailing list
MediaWiki-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/mediawiki-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Best full-text search engine?