Re: [Wikitech-l] Commons search. Was: Commons tech wishlist

11 Aug 2007

Gregory Maxwell wrote:
...
  On 8/11/07, Brion Vibber &lt;brion(a)wikimedia.org&gt;
wrote:
  Worth taking a look at, though I wonder why the
people working on
 Mayflower aren't:

 1) Active in the development list or IRC channel

 2) Committing their code to SVN  

 So how do you suggest a search for commons be integrated when it can't
 work off the current production database alone? 
That doesn't sound like any kind of impediment. The text search for
everything else doesn't work off the current production database alone,
either.

...
  Mayflower uses a periodic text extract from commons
which is passed
 through a stemmer, then the most frequent terms get stop-worded to
 prevent index bloat.  Incremental updates of the full text part of
 MayFlower aren't possible as currently designed. 
Ok...

...
  Since the start of Mayflower Tangotango and I have
discussed moving
 its backend to the search stuff I've been working on. The backend
 stuff I'm using is far faster, doesn't fall over with overly frequent
 keys, and handles incremental update just fine. I've finally gotten
 around to putting up a web front end for my search stuff, check it
 out, Commons version is at
 http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py, enwiki
 version is at http://tools.wikimedia.de/~gmaxwell/cgi-bin/enwiki_cattersect.py 
Awesome!

...
  The problem there is that my backend stuff depends on
PostgreSQL,
 because postgresql provides the inverted indexing which is utterly
 required for the   qualities of my implementation.  So on that case,
 again we're in a situation where it's not as simple as "commit it so
 SVN" since using that would require a non-trivial addition of software
 infrastructure which would have to be carefully considered. 
I think there's some misconceptions here. :)

First, we've got *lots* of irons in the fire in Subversion.

Not just the things we're running at the moment, but all sorts of
extensions and both experimental and used-elsewhere tangents like
Semantic MediaWiki and WikiData/OmegaWiki, experimental geographic
mapping extensions, the Python Wikipedia bot framework, and of course
MediaWiki support for PostgreSQL backends.

Putting the code into Subversion means that the entire community of
people working on the world of Wikimedia-related code can see it, review
it, and pitch in if interested.

That has obvious advantages for anything that does reach readiness to go
live on our own servers, of course, since the core devs can keep an eye
on it and aid in maintenance.

As with other tools such as the Lucene search, using additional backend
support tools is *not* a deal-breaker as long as they're free/open
source and can be worked into the system. Being 'partitioned' from the
rest of the system like with this sort of search backend makes it
especially easy -- it doesn't reach into every part of the system.

I would very strongly encourage this work to be done openly and using
the shared infrastructure of the Wikimedia and MediaWiki community.
We're pretty much handing out SVN accounts to anyone interested in
working on MediaWiki-related stuff, so don't feel that a wiki-integrated
search engine is going to be ignored or has to be hidden away until it's
perfected. :)

-- brion vibber (brion @ wikimedia.org)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Commons search. Was: Commons tech wishlist