On 30/04/11 21:38, MZMcBride wrote:
Where's the best documentation for the search setup? And are there any pages
If you by setup you mean the setup WMF is using then [1]. If you by setup you mean how we use Lucene (with some historical context) then [2] and [3] are a good starting point. Apart from that, it's reading the comments in the code.
with a roadmap for future development?
The roadmap is pretty much solving the bugs reported in bugzilla for the lucene-search extension. There is quite a few of them, but most of them are of technical nature.
Any further improvements in the *quality* of search results would require employing someone who specialises in natural language processing/data mining/search to improve on the existing algorithms. The algorithms we currently use are pretty much the-state-of-the-art in the opensource world, and I would consider any further improvement as proper scientific research.
I'm particularly curious if the Java component can't be killed.
I would doubt it. It isn't the case that we simply use Lucene out-of-the-box and that we could switch to another port. In fact, the backend search extension (lucene-search) is pretty big with some 50k lines of code. It implements a couple of algorithms I put together to work with the way how information is structured on Wikipedia, in languages I speak.
r.
[1] http://wikitech.wikimedia.org/view/Search [2] http://www.mediawiki.org/wiki/User:Rainman [3] http://www.mediawiki.org/wiki/User:Rainman/search_internals