As many people know, our current search infrastructure has caused a
few problems with the site. It's an area that was greatly improved by
the work that Robert Stojnic (a.k.a. Rainman) did in 2008, but he
hasn't had the time to keep up with it, and to date, the WMF hasn't
invested a lot in further developing it.
When I started with WMF, RobLa asked me to learn our search system and
start adding debugging information, with a plan to start fixing search
problems. I've captured what I've learned about our search
infrastructure here (it is being augmented on a daily basis currently):
http://wikitech.wikimedia.org/view/User:Ram/Search
While Rainman hasn't had a lot of time to dedicate to search, he offered
to spend some time with us to talk about it. We had a small
meeting today, and plan to have at least one or two more meetings
(including an Open Tech Talk soon). In addition to Rainman and me,
David Schoonover, Aaron Schulz, Tim Starling, RobLa were there.
This meeting was helpful for us in understanding lsearchd better, and
starting to talk about a possible plan to move to Solr. Meeting notes
below:
*
lsearchd deep dive*
What processes are running?
- searchfrontend & searchbackend
Indexer:
- index is a collection of files in a certian format
- one index daemon (per server?) (avoids synchronizaton/locking)
- RMI is used as a wrapper for searching the indexes that manages
local/foreign index shard access transparently
- Indexes are sharded on namespace and further into smaller parts (each
checked on query, e.g. map/reduced)
Index updates:
- Initial index building for a wiki is viia an XML dump using an
indexbuilder tool
- Incremental updates work via polling OAI
- There used to be a synchronous update triggered by the searchupdate
hook on article edit but that is disabled.
Misc notes:
- /db/searchterm request format to daemon, responses with one of
opensearch/xml/json format
- "prefix format" use for "lists of suggestions"
- search daemon using 80 threads (class SearchServer) (can run 80/'sec
search requests in parallel, higher than normal load (~10?))
- one daemon running on each server
Possible things to fix:
- better error handling? (e.g. on timeout)
- index opened multiple times and handles pooled. Searchers check
locally and then check foreign servers (index is partitioned). The pool
avoids synchronization around Files which would curtail concurrency. Solr
already makes optimizations for resource sharing.
- Current code is in searchpool (searchcache?) in the search package.
- RMI load balancing is not smart, just random (using solr probably
would deal with this)
- XMLRPC not used anymore (not since the switch to OAI)
- Fix bugs in disaled interwiki search code that caused it to hang
*Solr*
Current Lucene features to make sure new Solr version has:
- Custom ranking metric (we have custom MW logic for determining hit
score)
- "Did You Mean?" engine that can handle multi-word queries (e.g. for
spellchecking)
...potentially related Solr features:
http://lucene.apache.org/solr/features.html
(Query) Function Query - influence the score by user specified complex
functions
of numeric fields or query relevancy scores.
(Core) Pluggable user functions for Function Query
(Query) Auto-suggest functionality for completing user queries
(Query) Dynamic search results clustering using Carrot2
(Schema) Many additional text analysis components including word
splitting, regex and sounds-like filters
*Solr Links*
1.
http://lucene.apache.org/solr/ -- single-node frontend for index
query/update
2.
http://lucene.apache.org/solr/4_1_0/tutorial.html - 4.1.0 tutorial
3.
http://wiki.apache.org/solr/SolrCloud -- Sharding indices and using a
federated group of solr instances to serve query responses
*OAI:*
http://www.mediawiki.org/wiki/Extension:OAIRepository