Hi all,
Tian-Jian "Barabbas" Jiang said:
I suggest you test this by MRR (Mean Reciprocal Rate):
s/Rate/Rank/g Sorry about the typo. You may also want to check MAPs (Mean Average Precisions)
Although I bet you have already done it, here's my 2 cents: I usually adapt a concept to my IR system: Precision first, Recall next. For example, my system may do exact match first, get the results from
searcher.doc(topDocs.scoreDocs[i].doc)
and save them externally. It allows me to merge some more partial matched results later. Apparently these can be done by something like parallel queries, but I like to merge them sequentially by myself.
For queries contain common phrases, you may want to manipulate a more complicated RR based on similarity to the title, or just annotate an answer set by hand.
Otherwise, the search engine is fast and the results are overall > promising. Are you considering adding snippets of the search results? > Highlighting is a very cpu and memory consuming thingy. You need to fetch all articles in search results (i.e. 20 per page), retokenize them, fragment them in snippets, and score each snippet so you can show the best. I'm currently working on an distributed implementation for this, but it might still put too heavy load on the cluster.
Apache Solr may be an alternative solution for this.
BTW, I'm pretty interesting in lucene-related tasks. If it's OK to you, I would like to help. :)
Sincerely,
/Mike "b6s" Jiang/