Re: [Wikitech-l] Tag intersection database performance

28 Feb 2008


      On Thu, Feb 28, 2008 at 7:14 AM, Neil Harris usenet@tonal.clara.co.uk wrote:
...
Interesting. What they seem to be proposing is to store the tags for
 each article in a plain text field, and then use the built-in MySQL
 full-text search mechanism to index and search that, thus taking
 advantage of all the development already devoted to speeding up
 general-purpose full text search.
I wonder how it would scale to Wikipedia's vast datasets?
Well, the point is that that solution exactly has already been
discussed extensively on this list in exactly this context, and the
answer is "not clear".  Performance in various tests (primarily if not
exclusively by Aerik) was not terrible but might or might not be good
enough.  The thought was to use something like Lucene instead, which
would probably be faster.  The normalized performance was also tested
here, I think, by Greg Maxwell, and he concluded that PostgreSQL was
much faster (which this paper agrees with).
The interesting part is their remark that basically they have no idea
why fulltext is actually faster, since in principle it shouldn't be.
It makes me wonder if we won't have more effort by RDBMSes in the
future to implement efficient indexes for this kind of thing.
One thing that needs to be kept in mind is that their data set had
50,000 tag-item associations.  This is about 500 times smaller than
enwiki's categorylinks.  Scalability data with smaller and larger data
sets would have made an interesting addition to the paper.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Tag intersection database performance