Re: [Wikitech-l] Tag intersection database performance (Neil Harris)

28 Feb 2008

 Neil Harris wrote:

...

 Magnus Manske wrote:
  I just found

http://point.davidglasser.net/wp-content/uploads/point.davidglasser.net/200…

 and thought I'd share it with the list...

 Magnus

 Interesting. What they seem to be proposing is to store the tags for
 each article in a plain text field, and then use the built-in MySQL
 full-text search mechanism to index and search that, thus taking
 advantage of all the development already devoted to speeding up
 general-purpose full text search.

 I wonder how it would scale to Wikipedia's vast datasets?

 Argh... I tested exactly that question last year:
https://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-December/028081.h…

and then talked about it again here
https://lists.wikimedia.org/mailman/htdig/wikitech-l/2008-February/036570.h…

I think that the fulltext solution is probably very good for a mid-size
application, but it sounds like (form people who know more about MySQL
databases than I) that it would not stand up to Wikipedia's traffic.  Tim
Starling suggested that Lucene is better at intersections that MySQL's
fulltext database... I also did some testing with a lucene index, but I
really don't want to set up Java on my server to I used Zend_Search_Lucene.
That gave performance similar to the MySQL fulltext index *BUT* when I
queried the same index with Luke (which is Java), the query was *fast*.
Sorry, I can't find the mailing list posts about that.

So, I think the solution is to either a) add a field to the current search
index or b) create a new search index.  A fulltext index might make a nice
addition to mediawiki for smaller installations though (and folks who don't
want to run java).

FYI, I am using a fulltext index for tagging on my social bookmarking
application http://tagthis.info (I know it's not a great social bookmarking
app, the idea is that it's a hosted service where anyone can add tags to
webpages with some javascript - I'm beta testing it on my wiki directory)
and at that scale the performance is very adequate.  I'm looking at clucene
to set up an indexing daemon for the my higher performance searching needs
(maybe might interest folks on this list?)

Best Regards,
Aerik

-- 
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Tag intersection database performance (Neil Harris)