Hi Robert,
2007/5/23, Robert Stojnic rainmansr@gmail.com:
Hmmm. Searching for "Noam Chomsky" gives me a rather strange result.
Why is [[Noam Chomsky]] only at #4 in the results?
Hmm, good, this made me rethink the scoring, so I made some adjustments that favors larger articles a bit more, and favors more exact title matches. So now Noam Chomsky is in the right first place. :)
I suggest you test this by MRR (Mean Reciprocal Rate): 1. Uses all Titles as both Queries and Answers. 2. Evaluate each query result by Reciprocal Rate like this: If the answer shown as the top N result, score = 1/N 3. Calculate the average of each RR to get the MRR.
For queries contain common phrases, you may want to manipulate a more complicated RR based on similarity to the title, or just annotate an answer set by hand.
Otherwise, the search engine is fast and the results are overall
promising. Are you considering adding snippets of the search results?
Highlighting is a very cpu and memory consuming thingy. You need to fetch all articles in search results (i.e. 20 per page), retokenize them, fragment them in snippets, and score each snippet so you can show the best. I'm currently working on an distributed implementation for this, but it might still put too heavy load on the cluster.
Apache Solr may be an alternative solution for this.
BTW, I'm pretty interesting in lucene-related tasks. If it's OK to you, I would like to help. :)
Sincerely, /Mike "b6s" Jiang/