As many people know, our current search infrastructure has caused a few problems with the site. It's an area that was greatly improved by the work that Robert Stojnic (a.k.a. Rainman) did in 2008, but he hasn't had the time to keep up with it, and to date, the WMF hasn't invested a lot in further developing it.
When I started with WMF, RobLa asked me to learn our search system and start adding debugging information, with a plan to start fixing search problems. I've captured what I've learned about our search infrastructure here (it is being augmented on a daily basis currently):
http://wikitech.wikimedia.org/view/User:Ram/Search
While Rainman hasn't had a lot of time to dedicate to search, he offered to spend some time with us to talk about it. We had a small meeting today, and plan to have at least one or two more meetings (including an Open Tech Talk soon). In addition to Rainman and me, David Schoonover, Aaron Schulz, Tim Starling, RobLa were there.
This meeting was helpful for us in understanding lsearchd better, and starting to talk about a possible plan to move to Solr. Meeting notes below: * lsearchd deep dive* What processes are running?
- searchfrontend & searchbackend
Indexer:
- index is a collection of files in a certian format - one index daemon (per server?) (avoids synchronizaton/locking) - RMI is used as a wrapper for searching the indexes that manages local/foreign index shard access transparently - Indexes are sharded on namespace and further into smaller parts (each checked on query, e.g. map/reduced)
Index updates:
- Initial index building for a wiki is viia an XML dump using an indexbuilder tool - Incremental updates work via polling OAI - There used to be a synchronous update triggered by the searchupdate hook on article edit but that is disabled.
Misc notes:
- /db/searchterm request format to daemon, responses with one of opensearch/xml/json format - "prefix format" use for "lists of suggestions" - search daemon using 80 threads (class SearchServer) (can run 80/'sec search requests in parallel, higher than normal load (~10?)) - one daemon running on each server
Possible things to fix:
- better error handling? (e.g. on timeout) - index opened multiple times and handles pooled. Searchers check locally and then check foreign servers (index is partitioned). The pool avoids synchronization around Files which would curtail concurrency. Solr already makes optimizations for resource sharing. - Current code is in searchpool (searchcache?) in the search package. - RMI load balancing is not smart, just random (using solr probably would deal with this) - XMLRPC not used anymore (not since the switch to OAI) - Fix bugs in disaled interwiki search code that caused it to hang
*Solr* Current Lucene features to make sure new Solr version has:
- Custom ranking metric (we have custom MW logic for determining hit score) - "Did You Mean?" engine that can handle multi-word queries (e.g. for spellchecking)
...potentially related Solr features: http://lucene.apache.org/solr/features.html
(Query) Function Query - influence the score by user specified complex functions of numeric fields or query relevancy scores.
(Core) Pluggable user functions for Function Query
(Query) Auto-suggest functionality for completing user queries
(Query) Dynamic search results clustering using Carrot2
(Schema) Many additional text analysis components including word splitting, regex and sounds-like filters
*Solr Links*
1. http://lucene.apache.org/solr/ -- single-node frontend for index query/update 2. http://lucene.apache.org/solr/4_1_0/tutorial.html - 4.1.0 tutorial 3. http://wiki.apache.org/solr/SolrCloud -- Sharding indices and using a federated group of solr instances to serve query responses
*OAI:* http://www.mediawiki.org/wiki/Extension:OAIRepository