Thanks!
You are correct in assuming that this is designed for small-to-medium sized wikis--we have about 1000 users a couple of hundred edits per day, but our scale testing indicated it would handle at least 20-30 combined queries/updates per second (of normal sized pages).
I'm assuming that you mean "single-host" from a indexing server point of view, and yes, at this time that is completely correct. Article indexing, however, can easily support multiple Mediawiki servers calling it. Currently the attachment indexing relies on there only being a single Mediawiki server as well, but that's an easy modification.
There is a preload mechanism that grabs the pages directly from the database for indexing as well. At some point I intend to combine the two, thereby keeping the real-time update but also providing a background indexer in case the realtime feed fails for some reason (therefore ensuring that no articles are missed). For us the latter's not a big problem as we can reindex in about an hour.
Robert Stojnic wrote:
Very nice work Chris! I think it strikes a good balance of simplicity and flexibility that makes is ideal for small-to-medium sites.
The architecture itself seems to be similar to that used in early mwsearch, where index is updated via hooks that submit articles directly to indexer. So, it assumes single-host architecture and uses out-of-the-box lucene scoring and highlighting as far as i can see. I think the most interesting part for us is the handling of attachments. I see you use apache poi and pdfbox. We should really try to use this as well, this shouldn't be too hard to do, but needs a bit of fiddling..