Hi Robert,
Yes, there are multiple queries. In my scenario, "precision first" usually implied the amount of return results is limited. Users may not have patiences on both waiting for responses and reading for pages of results. That's why I prefer sequential process rather than parallel; I can guess a small and maybe precise result set first, then query for more if the result set seems to be too small, i.e. the recall is not high enough.
For example, a query in Chinese applies word-based analyzer first, with a limit, say 1000:
static int m_limit = 1000; Query query = _a_word_based_Chinese_query_here_; ArrayList<MyResult> resultList = new ArrayList<MyResult>(); TopDocs topDocs = m_standardSearcher.search(query, (Filter)null, m_limit); for(ScoreDoc scoreDoc: topDocs.scoreDocs) { Document doc = m_standardSearcher.doc(scoreDoc.doc); float score = scoreDoc.score; MyResult aResult = new MyResult(doc, score); resultList.add(aResult); }
Once the size of resultList did not reach 1000, another character-based query will be fired to get more results up to (1000 - current size).
It's a very simple heuristic and proved to be fast enough on single P4 2GHz machine with 2GB RAM, which served for a 3GB Lucene index file. Results returned within 1 sec, in average.
The problem of all multiple, parallel, or distributed Lucene queries is, score merging may not be reasonable, especially when indexes are in different strategy of tokenization.
You may be also interested in
http://issues.apache.org/jira/browse/NUTCH-92 , http://hellonline.com/blog/?p=55 , and http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12709.html
Thank you!
Cheers, /Mike/
Robert Stojnic wrote:
Hm, wouldn't that require running multiple queries for a single user query? If I'm understanding it correctly it refines search by trying different queries, and merges the results? For the wikipedia system, speed is of out most importance, since it's a high traffic site, and has very few resources (compared to other sites of same traffic).
r.
On 5/23/07, *Tian-Jian Barabbas Jiang@Gmail* <barabbas@gmail.com mailto:barabbas@gmail.com> wrote:
Although I bet you have already done it, here's my 2 cents: I usually adapt a concept to my IR system: Precision first, Recall next. For example, my system may do exact match first, get the results from searcher.doc(topDocs.scoreDocs[i].doc) and save them externally. It allows me to merge some more partial matched results later. Apparently these can be done by something like parallel queries, but I like to merge them sequentially by myself.
wikitech-l@lists.wikimedia.org