Last week I added some statistics reporting to our search daemon. It keeps track of a 1-minute rolling average of the rate of handled requests, discarded requests, the time it takes to service a request, and the number of simultaneously active threads.
This info is reported to Ganglia, and can be watched at eg: http://ganglia.wikimedia.org/large/?m=search_rate&r=hour&s=descendin... http://ganglia.wikimedia.org/large/?m=search_time&r=hour&s=descendin...
With a better idea of the actual performance of the system, we've been able to put some work into optimizing the system a little better.
First, several more old Apache boxes have been commandeered, increasing the search cluster from 3 to 8 machines. Second, Tim has switched the load balancing from simple round-robin plus failover to a more flexible and cleaner system using perlbal.
We found that the boxes with 3+ gigabytes of RAM performed significantly better than the boxes with only 1 gig, probably because they could not dedicate as much memory to caching the on-disk index files.
As of today, the cluster has been split into two groups, each separately managed by perlbal. The four 3+-gigabyte machines handle en.wikipedia.org and de.wikipedia.org, our two biggest and most active wikis, and the 1-gigabyte machines handle everything else. We'll know better during peak hours tomorrow, but so far it looks pretty good; reported dropped connections have nearly vanished, and average service times are below 50ms for all boxen.
Future work:
River has done some work on fancying up the search for Wikia, but we haven't yet gotten a clear agreement on whether or not the company is willing to open-source it. If they do do this soon, we may adopt Wikia's code.
If not, we'll continue working on the base we've got to spiffy it up. The first order of business is doing another round of comparisons on the base VM: currently we're running on Mono, which was chosen originally for the combination of being 1) open source, 2) reasonably performant, 3) not leaking memory. GCJ was a touch faster, but leaked memory. Sun's JVM didn't leak memory, but isn't quite open-source. Somewhere along the line, though, the Mono version sprung a memory leak and we have to restart the daemon regularly to keep it from dying. I'm uncertain whether this is in our code, in the Lucene port, or in Mono itself.
I'll want to check with a more current update to the C# Lucene port, and update the Java code to test against current versions of GCJ/Classpath and Sun's JVM, now that Lucene 2.0 is available.
Another important improvement we could make is better indexing updates: we should at least be able to add new pages to the index in close to real time, even if full rebuilds are still more intermittent.
And if it's ready for another release, I may check out the Sphinx search engine as well. It claims better speed and result ordering than Lucene, but when I was first testing it out it was too much in flux with a lot of new stuff going into the development version.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org