I would like to include myself as someone who knows what they are talking about but it might be a stretch!
Anyhow, we implemented Lucene on a wiki with approaching 350k pages. The performance and types of search available are impressive and we are using the existing search tables in MySQL to feed the Lucene index. We perofrm an incremental synchronization every 15 minutes and the indexing is fast with a couple of thousand documents being indexed per minute. It is unlikely we will have that many in a 15 minute period so all should be fine.
Things like language stemming, fishing returns fish and fishes etc, multiple language support, prioritization based upon various factors (title, occurrences of words near each other) and a host of other features.
Memory is indeed a consideration but even so we are able to run this on a 1GB dedicated server and still see search response times well under 200mS. We will be moving to a bigger server but not as a result of Lucene.
If you look at Lucene then also look to implement Solr which provides added functionality as well as the standard highlight search term within the teaser results and transaction management amongst multiple index servers.
If you want to talk to the best in the industry go to sematext (www.sematext.com) which is run by Otis, one of the original participants in Lucene. We are VERY happy with our move to Lucene and will be adding Solr in the next couple of weeks.
Hope some of this helps.
Regards, Paul
On 12/5/07 3:35 PM, "Jim Hu" jimhu@tamu.edu wrote:
I've been thinking about moving from the default to Lucene, and am NOT an expert, so take the following with lots of NaCl. I'd like to hear what people who know what they're talking about think!
As I understand it, Lucene indexes and stores the indexes into a set of index files that are kept in memory or are swapped in as needed and does not use the backend database that's running the wiki. By contrast, Sphinx works via mySQL. I believe that this difference can be important as the size and use of the wiki increases, since the search can end up taxing the db leading to performance degradation for mySQL. But if Lucene sucks up all your free memory, you could get performance problems outside mySQL. This is probably not an issue for your setup behind a firewall, but I'm wondering how to think about the tradeoffs for a smallish single-server wiki that sometimes gets swamped by search engine hits. And yes, I know that I need to learn more about robots.txt too...
Google also sells search appliances, in case you really want it to exactly like Google. ;)
Jim
On Dec 5, 2007, at 1:57 PM, Emufarmers Sangly wrote:
On Dec 5, 2007 10:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have any recommendations? This is intended to be used at an academic institution.
Lucene http://www.mediawiki.org/wiki/Extension:LuceneSearch is what Wikipedia uses. You might also want to take a look at Sphinx < http://www.mediawiki.org/wiki/Extension:SphinxSearch%3E.
-- Arr, ye emus, http://emufarmers.com _______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l