rainman@svn.wikimedia.org wrote:
Revision: 30390 Author: rainman Date: 2008-02-01 13:17:38 +0000 (Fri, 01 Feb 2008)
Log Message:
A new branch for LuceneSearch extension for the new daemon: will add ajax search and make some minor interface improvements.
Just a note -- I would recommend strongly against doing continued development on the old LuceneSearch front-end extension, as it's a maintenance nightmare.
Instead, new front-end code should be in the Special:Search front-end in core, with a back-end plugin to talk to the Lucene server (the MWSearch extension, possibly a bit out of date.)
-- brion vibber (brion @ wikimedia.org)
Brion -- have you considered using SOLR, which extends Lucene? An enterprise-class search engine, v1.3 is nearing release and in addition to XML and text, supports search inside rich documents including MS Office and PDF.
http://lucene.apache.org/solr/
Dan
On Feb 1, 2008 2:20 PM, Brion Vibber brion@wikimedia.org wrote:
rainman@svn.wikimedia.org wrote:
Revision: 30390 Author: rainman Date: 2008-02-01 13:17:38 +0000 (Fri, 01 Feb 2008)
Log Message:
A new branch for LuceneSearch extension for the new daemon: will add ajax search and make some minor interface improvements.
Just a note -- I would recommend strongly against doing continued development on the old LuceneSearch front-end extension, as it's a maintenance nightmare.
Instead, new front-end code should be in the Special:Search front-end in core, with a back-end plugin to talk to the Lucene server (the MWSearch extension, possibly a bit out of date.)
-- brion vibber (brion @ wikimedia.org)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Feb 1, 2008 9:16 PM, Dan Thomas geoobject@gmail.com wrote:
Brion -- have you considered using SOLR, which extends Lucene? An enterprise-class search engine, v1.3 is nearing release and in addition to XML and text, supports search inside rich documents including MS Office and PDF.
SOLR is a great wrapper around lucene, however I believe its focus is different from what we need - my impression is that its main goal is to provide an easy and powerful interface for what lucene already does, with enhancement relevant for enterprise applications (e.g flexible schema structure). It doesn't address almost any of issues we are having : 1) solr doesn't support distributed searching and split indexes (this is however being worked on, afaik) - this is crucial since our indexes are just too big to be on a single host 2) there is no advanced scoring scheme, for instance using backlinks, etc.. i.e. it offers same as lucene. 3) the default spellchecker is the one lucene uses, i.e. with per-word suggestions - works fine on small data sets but gives pretty bad suggestions on large ones 4) uses the highlighting that splits text into equal-size chunks not trying to look at sentence boundaries and also doesn't support highlighting matching phrases. I'm not sure how efficient its text storage is since we have a huge amount of text.. 5) no integrated prefix search for ajax suggestions 6) no parser for wiki syntax (although there is a wiki-parser being developed for lucene)
On Feb 1, 2008 2:20 PM, Brion Vibber brion@wikimedia.org wrote:
Instead, new front-end code should be in the Special:Search front-end in core, with a back-end plugin to talk to the Lucene server (the MWSearch extension, possibly a bit out of date.)
<nod>
r.
In my interpretation, Solr is about building plumbing and features necessary for a google-like user experience with open source search. Following the solr community dev/user lists, the effort to 'enterprise' the lucene library is significant. To your point below Robert, as i understand it, solr does offer clustering and topology options (and experience) for distributed search and split index not available in lucene alone. Solr api, schemas, facets and similar preserve/add structure to wiki content and enable customization that produce better match results and open door to rich mining, analysis and relationship discovery.
Since lucene is still fully available with solr, it raises two questions: 1) whether the additional functionality solr provides make it worthwhile to integrate at that level rather than at lucene library level, and 2) is search important and different enough from wiki mission to coordinate rather than duplicate.
On Feb 1, 2008 4:26 PM, Robert Stojnic rainmansr@gmail.com wrote:
On Feb 1, 2008 9:16 PM, Dan Thomas geoobject@gmail.com wrote:
Brion -- have you considered using SOLR, which extends Lucene? An enterprise-class search engine, v1.3 is nearing release and in addition to XML and text, supports search inside rich documents including MS Office and PDF.
SOLR is a great wrapper around lucene, however I believe its focus is different from what we need - my impression is that its main goal is to provide an easy and powerful interface for what lucene already does, with enhancement relevant for enterprise applications (e.g flexible schema structure). It doesn't address almost any of issues we are having :
- solr doesn't support distributed searching and split indexes (this is
however being worked on, afaik) - this is crucial since our indexes are just too big to be on a single host 2) there is no advanced scoring scheme, for instance using backlinks, etc.. i.e. it offers same as lucene. 3) the default spellchecker is the one lucene uses, i.e. with per-word suggestions - works fine on small data sets but gives pretty bad suggestions on large ones 4) uses the highlighting that splits text into equal-size chunks not trying to look at sentence boundaries and also doesn't support highlighting matching phrases. I'm not sure how efficient its text storage is since we have a huge amount of text.. 5) no integrated prefix search for ajax suggestions 6) no parser for wiki syntax (although there is a wiki-parser being developed for lucene)
On Feb 1, 2008 2:20 PM, Brion Vibber brion@wikimedia.org wrote:
Instead, new front-end code should be in the Special:Search front-end in core, with a back-end plugin to talk to the Lucene server (the MWSearch extension, possibly a bit out of date.)
<nod>
r.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org