Kate's Lucene-based search server is now up and running experimentally to cover searches on en.wikipedia.org. It's compiled with GCJ, so it's not polluted by any of that dirty icky not-quite-free Sun Java VM stuff. ;)
For those of you new to the game, Lucene is a text search engine written in Java, sponsored by the Apache project: http://lucene.apache.org/ Using a separate search server like this instead of MySQL's fulltext index lets us take some load off the main databases.
To compare our options I did an experimental port to C# using dotlucene; some benchmarking showed that while the C# version running on Mono outpaced the Java version on GCJ for building the index, Java+GCJ did better on actual searches (even surpassing Sun's Java in some tests). Since searches are more time-critical (as long as updates can keep up with the rate of edits), we'll probably stick with Java.
More info at: * http://www.livejournal.com/community/wikitech/9608.html * http://meta.wikimedia.org/wiki/User:Brion_VIBBER/MWDaemon
At the moment the drop-down suggest-while-you-type box is disabled as GCJ and BerkeleyDB Java Edition really don't get along. I'll either hack it to use the native library version of BDB or just rewrite the title prefix matcher to use a different backend.
-- brion vibber (brion @ pobox.com)
This is really neat.
I see from the wiki page that you have looked at PyLucene, but are you considering reapplying their strategy of wrapping the gcj using SWIG? It looks like swig can now generate php bindings as well, leaving you with native, php access to the lucene apis. PHPLucene?
This might be precisely how you were planning on accessing the gcj lucene, but I wasn't sure. After seeing a presentation on PyLucene at PyCon, I think the PyLucene people would be interested in seeing (maybe even helping) their project expand beyond just python bindings.
best, Jonah
----- Original Message ----- From: "Brion Vibber" brion@pobox.com To: "Wikimedia developers" wikitech-l@wikimedia.org Sent: Sunday, April 10, 2005 8:17 AM Subject: [Wikitech-l] Lucene search
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Jonah Bossewitch wrote in gmane.science.linguistics.wikipedia.technical:
I see from the wiki page that you have looked at PyLucene, but are you considering reapplying their strategy of wrapping the gcj using SWIG? It looks like swig can now generate php bindings as well, leaving you with native, php access to the lucene apis. PHPLucene?
This might be precisely how you were planning on accessing the gcj lucene,
the design is client/server, so the specific implementation language of the server isn't much of an issue - someone may want to rewrite it in Python (or even PHP), but personally i don't see much gain from this. the only reason GCJ is used over another JVM is because of the license.
but I wasn't sure. After seeing a presentation on PyLucene at PyCon, I think the PyLucene people would be interested in seeing (maybe even helping) their project expand beyond just python bindings.
best, Jonah
kate.
Please consider also to update and use our new page http://meta.wikipedia.org/wiki/FulltextSearchEngines
TIA, Tom
On Sun, 2005-04-10 at 05:17 -0700, Brion Vibber wrote:
Kate's Lucene-based search server is now up and running experimentally
Cool...so where's the code?
I'd like to try this out under stress with my test suite (which I just resurrected from the CVS attic).
For anyone else who might be interested in the test suite, which actually tests the wiki over-the-web with multi-threaded fetches and searches and such and shares no code at all with mediawiki, making it a proper QA test, I've put it on my subversion server (I don't much like CVS): http://piclab.com/svn/wikitest
It compiles and runs, and loads the wiki with lots of sample pages, but most of the actual tests fail now because the wiki doesn't much resemble what it did when I wrote the tests--so I'll be fixing that in the next few days.
Lee Daniel Crocker wrote:
On Sun, 2005-04-10 at 05:17 -0700, Brion Vibber wrote:
Kate's Lucene-based search server is now up and running experimentally
Cool...so where's the code?
lucene-search module in CVS.
I'm still fiddling with my C# version also, which has not yet been checked into CVS.
I'd like to try this out under stress with my test suite (which I just resurrected from the CVS attic).
For anyone else who might be interested in the test suite, which actually tests the wiki over-the-web with multi-threaded fetches and searches and such and shares no code at all with mediawiki, making it a proper QA test, I've put it on my subversion server (I don't much like CVS): http://piclab.com/svn/wikitest
Personally, I'd prefer to see more work on the internal parser tests and unit tests.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
I'd like to try this out under stress with my test suite (which I just resurrected from the CVS attic).
For anyone else who might be interested in the test suite, which actually tests the wiki over-the-web with multi-threaded fetches and searches and such and shares no code at all with mediawiki, making it a proper QA test, I've put it on my subversion server (I don't much like CVS): http://piclab.com/svn/wikitest
Personally, I'd prefer to see more work on the internal parser tests and unit tests.
I should note that I don't necessarily think an HTTP-based test suite is useless; such tests would be necessary to validate HTTP caching behavior, link resolution, etc.
But the variability of the user interface makes it really hard to do anything useful with an HTML screen-scraper, and that includes automated testing. Lower-level tests that hook into the code should a) be able to effectively test things that are hard or impossible to test from outside, and b) promotes the ongoing modularisation of the code.
The parser test suite is run from maintenance/parserTests.php (test cases in maintenance/parserTests.txt). There are some PhpUnit-based unit tests in tests/ also, but these aren't very complete and haven't been kept up to date for changes in HEAD.
There's also some standalone tests for the Unicode validation and normalization library in includes/normal/.
-- brion vibber (brion @ pobox.com)
On Sun, 2005-04-10 at 23:48 -0700, Brion Vibber wrote:
Personally, I'd prefer to see more work on the internal parser tests and unit tests.
Those are probably more important for keeping the existing software running smoothly, and I certainly join you in encouraging their development. But the external tests are more useful for testing the impact of big changes, like a new search engine, various server task splitting methods, database schema changes, and so on. And of course, parser tests on either side aren't very meaningful until there's a formally defined syntax, but I'm working on that too.
There's been some sort of memory leak in the MWDaemon running on Vincent. After a few hours it starts swapping, and the daemon has to be restarted to free up the memory.
In some quick testing on my own box (running Ubuntu with a Linux 2.6 kernel) I haven't been able to reproduce leaking; memory usage seems to stay fairly constant at least over a few minutes of grueling testing.
Vincent is running Debian-testing, and for some reason was installed with a 2.4 kernel. Differences in threading behavior of the older kernel might or might not be related to the leak, so I figured it'd be worth giving it an upgrade. I installed the 2.6 kernel package, poked briefly at the grub menu to make sure it looked right, and restarted. The machine either hasn't come back up or hasn't reconnected to the network, and it's either not on the serial console server or isn't responding (and its port isn't labeled), so it'll need to be checked out at some point.
In the meantime, I'm setting up a second Lucene search server on Isidore, the other experimental machine which is running FreeBSD 5.3. The GCC build is slightly broken on FreeBSD so this required a little patching and pain, but we'll see how it goes.
While the index is building, the search is offline.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
The machine either hasn't come back up or hasn't reconnected to the network, and it's either not on the serial console server or isn't responding (and its port isn't labeled), so it'll need to be checked out at some point.
Chad's fixed it; there's some funky difference in how the new kernel uses the drives that caused it not to be able to mount the drives on boot (SATA oddities).
The Lucene server is now running on vincent and avicenna to split the load.
In the meantime, I'm setting up a second Lucene search server on Isidore, the other experimental machine which is running FreeBSD 5.3. The GCC build is slightly broken on FreeBSD so this required a little patching and pain, but we'll see how it goes.
I'm not sure what the cause was (GCC build, threading, disk, etc) but Isidore seemed a bit sluggish. The index build was running about a third as fast as it did on vincent, and I uploaded a test copy instead to run from until vincent came back online.
The rate of service for the daemon seemed a bit slower than Vincent was peaking out at.
-- brion vibber (brion @ pobox.com)
The Lucene search is now active on the following wikis: * http://en.wikipedia.org/ * http://eo.wikipedia.org/ * http://ru.wikipedia.org/ * http://meta.wikimedia.org/
And when the index is done building, also: * http://de.wikipedia.org/
The choice of initial languages is driven by availability; English, German, and Russian word-stem normalizers are included in the main Lucene package and I wrote a quickie Esperanto one as a test since I was familiar with the language. There are a number of other analyzers available in contribs packages which I'll be setting up over the next couple days, as well as indexing the rest of the wikis on the supported languages.
At the moment the indexes aren't being automatically updated with changed pages, but that should be a relatively straightforward addition which is also coming soon.
-- brion vibber (brion @ pobox.com)
On 4/14/05, Brion Vibber brion@pobox.com wrote:
The choice of initial languages is driven by availability; English, German, and Russian word-stem normalizers are included in the main Lucene package and I wrote a quickie Esperanto one as a test since I was familiar with the language. There are a number of other analyzers available in contribs packages which I'll be setting up over the next couple days, as well as indexing the rest of the wikis on the supported languages.
How hard would it be to come up with these word-stem normalizers for other languages (i.e. did you base Esperanto off of another similar language or did you come up with it yourself relatively easily)? Is there a good description somewhere on how to come up with them?
Dori
Dori wrote:
How hard would it be to come up with these word-stem normalizers for other languages (i.e. did you base Esperanto off of another similar language or did you come up with it yourself relatively easily)? Is there a good description somewhere on how to come up with them?
I took a quick look at the PorterStemFilter class (for English) that comes in the Lucene distribution to see which classes I had to inherit from and what interface to implement, then just whipped up some regular expressions as a quick hack. It seems relatively straightforward, as long as the language isn't too tricky. :)
There are other existing filters out there, as mentioned.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org