Gregory Maxwell wrote:
On 8/11/07, Brion Vibber <brion(a)wikimedia.org>
wrote:
Worth taking a look at, though I wonder why the
people working on
Mayflower aren't:
1) Active in the development list or IRC channel
2) Committing their code to SVN
So how do you suggest a search for commons be integrated when it can't
work off the current production database alone?
That doesn't sound like any kind of impediment. The text search for
everything else doesn't work off the current production database alone,
either.
Mayflower uses a periodic text extract from commons
which is passed
through a stemmer, then the most frequent terms get stop-worded to
prevent index bloat. Incremental updates of the full text part of
MayFlower aren't possible as currently designed.
Ok...
Since the start of Mayflower Tangotango and I have
discussed moving
its backend to the search stuff I've been working on. The backend
stuff I'm using is far faster, doesn't fall over with overly frequent
keys, and handles incremental update just fine. I've finally gotten
around to putting up a web front end for my search stuff, check it
out, Commons version is at
http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py, enwiki
version is at
http://tools.wikimedia.de/~gmaxwell/cgi-bin/enwiki_cattersect.py
Awesome!
The problem there is that my backend stuff depends on
PostgreSQL,
because postgresql provides the inverted indexing which is utterly
required for the qualities of my implementation. So on that case,
again we're in a situation where it's not as simple as "commit it so
SVN" since using that would require a non-trivial addition of software
infrastructure which would have to be carefully considered.
I think there's some misconceptions here. :)
First, we've got *lots* of irons in the fire in Subversion.
Not just the things we're running at the moment, but all sorts of
extensions and both experimental and used-elsewhere tangents like
Semantic MediaWiki and WikiData/OmegaWiki, experimental geographic
mapping extensions, the Python Wikipedia bot framework, and of course
MediaWiki support for PostgreSQL backends.
Putting the code into Subversion means that the entire community of
people working on the world of Wikimedia-related code can see it, review
it, and pitch in if interested.
That has obvious advantages for anything that does reach readiness to go
live on our own servers, of course, since the core devs can keep an eye
on it and aid in maintenance.
As with other tools such as the Lucene search, using additional backend
support tools is *not* a deal-breaker as long as they're free/open
source and can be worked into the system. Being 'partitioned' from the
rest of the system like with this sort of search backend makes it
especially easy -- it doesn't reach into every part of the system.
I would very strongly encourage this work to be done openly and using
the shared infrastructure of the Wikimedia and MediaWiki community.
We're pretty much handing out SVN accounts to anyone interested in
working on MediaWiki-related stuff, so don't feel that a wiki-integrated
search engine is going to be ignored or has to be hidden away until it's
perfected. :)
-- brion vibber (brion @
wikimedia.org)