Gregory Maxwell wrote:
On 8/11/07, Brion Vibber brion@wikimedia.org wrote:
Worth taking a look at, though I wonder why the people working on Mayflower aren't:
Active in the development list or IRC channel
Committing their code to SVN
So how do you suggest a search for commons be integrated when it can't work off the current production database alone?
That doesn't sound like any kind of impediment. The text search for everything else doesn't work off the current production database alone, either.
Mayflower uses a periodic text extract from commons which is passed through a stemmer, then the most frequent terms get stop-worded to prevent index bloat. Incremental updates of the full text part of MayFlower aren't possible as currently designed.
Ok...
Since the start of Mayflower Tangotango and I have discussed moving its backend to the search stuff I've been working on. The backend stuff I'm using is far faster, doesn't fall over with overly frequent keys, and handles incremental update just fine. I've finally gotten around to putting up a web front end for my search stuff, check it out, Commons version is at http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py, enwiki version is at http://tools.wikimedia.de/~gmaxwell/cgi-bin/enwiki_cattersect.py
Awesome!
The problem there is that my backend stuff depends on PostgreSQL, because postgresql provides the inverted indexing which is utterly required for the qualities of my implementation. So on that case, again we're in a situation where it's not as simple as "commit it so SVN" since using that would require a non-trivial addition of software infrastructure which would have to be carefully considered.
I think there's some misconceptions here. :)
First, we've got *lots* of irons in the fire in Subversion.
Not just the things we're running at the moment, but all sorts of extensions and both experimental and used-elsewhere tangents like Semantic MediaWiki and WikiData/OmegaWiki, experimental geographic mapping extensions, the Python Wikipedia bot framework, and of course MediaWiki support for PostgreSQL backends.
Putting the code into Subversion means that the entire community of people working on the world of Wikimedia-related code can see it, review it, and pitch in if interested.
That has obvious advantages for anything that does reach readiness to go live on our own servers, of course, since the core devs can keep an eye on it and aid in maintenance.
As with other tools such as the Lucene search, using additional backend support tools is *not* a deal-breaker as long as they're free/open source and can be worked into the system. Being 'partitioned' from the rest of the system like with this sort of search backend makes it especially easy -- it doesn't reach into every part of the system.
I would very strongly encourage this work to be done openly and using the shared infrastructure of the Wikimedia and MediaWiki community. We're pretty much handing out SVN accounts to anyone interested in working on MediaWiki-related stuff, so don't feel that a wiki-integrated search engine is going to be ignored or has to be hidden away until it's perfected. :)
-- brion vibber (brion @ wikimedia.org)