Neil Harris wrote:
Magnus Manske wrote:
I just found
http://point.davidglasser.net/wp-content/uploads/point.davidglasser.net/2008...
and thought I'd share it with the list...
Magnus
Interesting. What they seem to be proposing is to store the tags for each article in a plain text field, and then use the built-in MySQL full-text search mechanism to index and search that, thus taking advantage of all the development already devoted to speeding up general-purpose full text search.
I wonder how it would scale to Wikipedia's vast datasets?
Argh... I tested exactly that question last year: https://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-December/028081.ht...
and then talked about it again here https://lists.wikimedia.org/mailman/htdig/wikitech-l/2008-February/036570.ht...
I think that the fulltext solution is probably very good for a mid-size application, but it sounds like (form people who know more about MySQL databases than I) that it would not stand up to Wikipedia's traffic. Tim Starling suggested that Lucene is better at intersections that MySQL's fulltext database... I also did some testing with a lucene index, but I really don't want to set up Java on my server to I used Zend_Search_Lucene. That gave performance similar to the MySQL fulltext index *BUT* when I queried the same index with Luke (which is Java), the query was *fast*. Sorry, I can't find the mailing list posts about that.
So, I think the solution is to either a) add a field to the current search index or b) create a new search index. A fulltext index might make a nice addition to mediawiki for smaller installations though (and folks who don't want to run java).
FYI, I am using a fulltext index for tagging on my social bookmarking application http://tagthis.info (I know it's not a great social bookmarking app, the idea is that it's a hosted service where anyone can add tags to webpages with some javascript - I'm beta testing it on my wiki directory) and at that scale the performance is very adequate. I'm looking at clucene to set up an indexing daemon for the my higher performance searching needs (maybe might interest folks on this list?)
Best Regards, Aerik
On Thu, Feb 28, 2008 at 3:19 PM, Aerik Sylvan aerik@thesylvans.com wrote:
So, I think the solution is to either a) add a field to the current search index or b) create a new search index. A fulltext index might make a nice addition to mediawiki for smaller installations though (and folks who don't want to run java).
Yes, I very much agree. In fact, a good first step here would be to just add the functionality to the software using fulltext, enabled by default, and have Wikimedia disable it (and perhaps enable it experimentally on some of the small wikis). And just add appropriate extensibility like Special:Search has, so that LuceneSearch can be adapted to allow using Lucene for this. I might even be interested in doing this over the summer, if I don't have loads of other stuff to do. But my commitments tend not to be very reliable (how many bugs are still assigned to me from ages ago?), so I'm not making a commitment here.
Hello,
That gave performance similar to the MySQL fulltext index *BUT* when I queried the same index with Luke (which is Java), the query was *fast*. Sorry, I can't find the mailing list posts about that.
Zend Lucene is 100x slower than Java Lucene.
On 29/02/2008, Domas Mituzas midom.lists@gmail.com wrote:
That gave performance similar to the MySQL fulltext index *BUT* when I queried the same index with Luke (which is Java), the query was *fast*. Sorry, I can't find the mailing list posts about that.
Zend Lucene is 100x slower than Java Lucene.
We were running a Mono version of Lucene for a while, weren't we? How did that compare?
- d.
David Gerard wrote:
On 29/02/2008, Domas Mituzas midom.lists@gmail.com wrote:
That gave performance similar to the MySQL fulltext index *BUT* when I queried the same index with Luke (which is Java), the query was *fast*. Sorry, I can't find the mailing list posts about that.
Zend Lucene is 100x slower than Java Lucene.
We were running a Mono version of Lucene for a while, weren't we? How did that compare?
It was moderately slower than the Java, but on the same order of magnitude for most stuff. Performance differences here were mainly about the Mono VM being a bit slower (at least at the time) and in some cases the regex library being much less efficient (index generation).
The reasons for using Mono at the time over Sun Java or GCJ were:
* Sun Java - fast, but not open source enough * GCJ - fast, open source, but mystery memory leaks * Mono - a bit slower, open source, no mystery memory leaks
Of course over time, mystery memory leaks crept into the system. ;)
Eventually, Sun Java became more and more open to the point where we don't really care anymore (if we get real pissy about it again we could start running an OpenJDK-based VM such as IcedTea), and the guy who picked up development on our Lucene server again preferred to work with the Java version instead of the C# one. (Among other things, this gives you access to the latest Lucene version instead of an older port.)
There's no real reason to choose Mono for this sort of task to start again. Hypothetically if we wanted to ship a Lucene-based tool by default, we could attempt to have backends supporting both the PHP Zend Lucene and the Java one... assuming you can get even vaguely useful performance out of the PHP one. :)
-- brion vibber (brion @ wikimedia.org)
wikitech-l@lists.wikimedia.org