On 8/11/07, Brion Vibber brion@wikimedia.org wrote:
Worth taking a look at, though I wonder why the people working on Mayflower aren't:
Active in the development list or IRC channel
Committing their code to SVN
So how do you suggest a search for commons be integrated when it can't work off the current production database alone?
Mayflower uses a periodic text extract from commons which is passed through a stemmer, then the most frequent terms get stop-worded to prevent index bloat. Incremental updates of the full text part of MayFlower aren't possible as currently designed.
Since the start of Mayflower Tangotango and I have discussed moving its backend to the search stuff I've been working on. The backend stuff I'm using is far faster, doesn't fall over with overly frequent keys, and handles incremental update just fine. I've finally gotten around to putting up a web front end for my search stuff, check it out, Commons version is at http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py, enwiki version is at http://tools.wikimedia.de/~gmaxwell/cgi-bin/enwiki_cattersect.py
The problem there is that my backend stuff depends on PostgreSQL, because postgresql provides the inverted indexing which is utterly required for the qualities of my implementation. So on that case, again we're in a situation where it's not as simple as "commit it so SVN" since using that would require a non-trivial addition of software infrastructure which would have to be carefully considered.