On Feb 1, 2008 9:16 PM, Dan Thomas geoobject@gmail.com wrote:
Brion -- have you considered using SOLR, which extends Lucene? An enterprise-class search engine, v1.3 is nearing release and in addition to XML and text, supports search inside rich documents including MS Office and PDF.
SOLR is a great wrapper around lucene, however I believe its focus is different from what we need - my impression is that its main goal is to provide an easy and powerful interface for what lucene already does, with enhancement relevant for enterprise applications (e.g flexible schema structure). It doesn't address almost any of issues we are having : 1) solr doesn't support distributed searching and split indexes (this is however being worked on, afaik) - this is crucial since our indexes are just too big to be on a single host 2) there is no advanced scoring scheme, for instance using backlinks, etc.. i.e. it offers same as lucene. 3) the default spellchecker is the one lucene uses, i.e. with per-word suggestions - works fine on small data sets but gives pretty bad suggestions on large ones 4) uses the highlighting that splits text into equal-size chunks not trying to look at sentence boundaries and also doesn't support highlighting matching phrases. I'm not sure how efficient its text storage is since we have a huge amount of text.. 5) no integrated prefix search for ajax suggestions 6) no parser for wiki syntax (although there is a wiki-parser being developed for lucene)
On Feb 1, 2008 2:20 PM, Brion Vibber brion@wikimedia.org wrote:
Instead, new front-end code should be in the Special:Search front-end in core, with a back-end plugin to talk to the Lucene server (the MWSearch extension, possibly a bit out of date.)
<nod>
r.