In my interpretation, Solr is about building plumbing and features
necessary for a google-like user experience with open source search.
Following the solr community dev/user lists, the effort to
'enterprise' the lucene library is significant. To your point below
Robert, as i understand it, solr does offer clustering and topology
options (and experience) for distributed search and split index not
available in lucene alone. Solr api, schemas, facets and similar
preserve/add structure to wiki content and enable customization that
produce better match results and open door to rich mining, analysis
and relationship discovery.
Since lucene is still fully available with solr, it raises two
questions: 1) whether the additional functionality solr provides make
it worthwhile to integrate at that level rather than at lucene library
level, and 2) is search important and different enough from wiki
mission to coordinate rather than duplicate.
On Feb 1, 2008 4:26 PM, Robert Stojnic <rainmansr(a)gmail.com> wrote:
On Feb 1, 2008 9:16 PM, Dan Thomas
<geoobject(a)gmail.com> wrote:
Brion -- have you considered using SOLR, which
extends Lucene? An
enterprise-class search engine, v1.3 is nearing release and in
addition to XML and text, supports search inside rich documents
including MS Office and PDF.
http://lucene.apache.org/solr/
SOLR is a great wrapper around lucene, however I believe its focus is
different from what we need - my impression is that its main goal is to
provide an easy and powerful interface for what lucene already does, with
enhancement relevant for enterprise applications (e.g flexible schema
structure). It doesn't address almost any of issues we are having :
1) solr doesn't support distributed searching and split indexes (this is
however being worked on, afaik) - this is crucial since our indexes are just
too big to be on a single host
2) there is no advanced scoring scheme, for instance using backlinks, etc..
i.e. it offers same as lucene.
3) the default spellchecker is the one lucene uses, i.e. with per-word
suggestions - works fine on small data sets but gives pretty bad suggestions
on large ones
4) uses the highlighting that splits text into equal-size chunks not trying
to look at sentence boundaries and also doesn't support highlighting
matching phrases. I'm not sure how efficient its text storage is since we
have a huge amount of text..
5) no integrated prefix search for ajax suggestions
6) no parser for wiki syntax (although there is a wiki-parser being
developed for lucene)
On Feb 1, 2008 2:20 PM, Brion Vibber
<brion(a)wikimedia.org> wrote:
Instead, new front-end code should be in the Special:Search front-end in
core, with a back-end plugin to talk to the Lucene server (the MWSearch
extension, possibly a bit out of date.)
<nod>
r.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l