On 8/11/07, Brion Vibber <brion(a)wikimedia.org> wrote:
Worth taking a look at, though I wonder why the people
working on
Mayflower aren't:
1) Active in the development list or IRC channel
2) Committing their code to SVN
So how do you suggest a search for commons be integrated when it can't
work off the current production database alone?
Mayflower uses a periodic text extract from commons which is passed
through a stemmer, then the most frequent terms get stop-worded to
prevent index bloat. Incremental updates of the full text part of
MayFlower aren't possible as currently designed.
Since the start of Mayflower Tangotango and I have discussed moving
its backend to the search stuff I've been working on. The backend
stuff I'm using is far faster, doesn't fall over with overly frequent
keys, and handles incremental update just fine. I've finally gotten
around to putting up a web front end for my search stuff, check it
out, Commons version is at
http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py, enwiki
version is at
http://tools.wikimedia.de/~gmaxwell/cgi-bin/enwiki_cattersect.py
The problem there is that my backend stuff depends on PostgreSQL,
because postgresql provides the inverted indexing which is utterly
required for the qualities of my implementation. So on that case,
again we're in a situation where it's not as simple as "commit it so
SVN" since using that would require a non-trivial addition of software
infrastructure which would have to be carefully considered.