Thanks!
You are correct in assuming that this is designed for small-to-medium sized wikis--we have about 1000 users a couple of hundred edits per day, but our scale testing indicated it would handle at least 20-30 combined queries/updates per second (of normal sized pages).
I'm assuming that you mean "single-host" from a indexing server point of view, and yes, at this time that is completely correct. Article indexing, however, can easily support multiple Mediawiki servers calling it. Currently the attachment indexing relies on there only being a single Mediawiki server as well, but that's an easy modification.
There is a preload mechanism that grabs the pages directly from the database for indexing as well. At some point I intend to combine the two, thereby keeping the real-time update but also providing a background indexer in case the realtime feed fails for some reason (therefore ensuring that no articles are missed). For us the latter's not a big problem as we can reindex in about an hour.
Robert Stojnic wrote:
Very nice work Chris! I think it strikes a good balance of simplicity and flexibility that makes is ideal for small-to-medium sites.
The architecture itself seems to be similar to that used in early mwsearch, where index is updated via hooks that submit articles directly to indexer. So, it assumes single-host architecture and uses out-of-the-box lucene scoring and highlighting as far as i can see. I think the most interesting part for us is the handling of attachments. I see you use apache poi and pdfbox. We should really try to use this as well, this shouldn't be too hard to do, but needs a bit of fiddling..
Chris Reigrut wrote:
Thanks!
You are correct in assuming that this is designed for small-to-medium sized wikis--we have about 1000 users a couple of hundred edits per day, but our scale testing indicated it would handle at least 20-30 combined queries/updates per second (of normal sized pages).
Yes, I am pretty sure that would be enough for pretty much any mediawiki site except the few largest...
I'm assuming that you mean "single-host" from a indexing server point of view, and yes, at this time that is completely correct. Article indexing, however, can easily support multiple Mediawiki servers calling it. Currently the attachment indexing relies on there only being a single Mediawiki server as well, but that's an easy modification.
Agreed. However, the real-time-update can only work with a single-host setup, since if one wants to have multiple searchers one needs to have some sort of replication, which raises all kinds of issues, like how frequently, optimized or not, how to deal with index warmup and hotswaps, synchronization overhead and such...
There is a preload mechanism that grabs the pages directly from the database for indexing as well. At some point I intend to combine the two, thereby keeping the real-time update but also providing a background indexer in case the realtime feed fails for some reason (therefore ensuring that no articles are missed). For us the latter's not a big problem as we can reindex in about an hour.
One other problem people had with lucene-search is that it eats a lot of resources... lucene-search can easily use couple of gigs of RAM just for the java process because of all of different caches and stuff. I was wondering if it was possible to make a lightweight server that would nicely work e.g. on an oldish machine with 128mb of ram or on shared hosting?
R.
El 5/7/09 3:46 PM, Robert Stojnic escribió:
One other problem people had with lucene-search is that it eats a lot of resources... lucene-search can easily use couple of gigs of RAM just for the java process because of all of different caches and stuff. I was wondering if it was possible to make a lightweight server that would nicely work e.g. on an oldish machine with 128mb of ram or on shared hosting?
Of course if you really want to be scary there's always Zend's pure-PHP implementation of Lucene... :)
http://framework.zend.com/manual/en/zend.search.lucene.html
I suspect performance is dreadful for large or heavily loaded sites, though I'm a bit curious if anyone's fiddled with it for smaller sites. Not being able to share code with the Java implementation means there would need to be more reimplentation, of course. :(
But not needing a Java VM means it could be easier to set up, and more likely to be usable by folks on shared hosting.
-- brion
mediawiki-l@lists.wikimedia.org