On Thu, Feb 28, 2008 at 7:14 AM, Neil Harris usenet@tonal.clara.co.uk wrote:
Interesting. What they seem to be proposing is to store the tags for each article in a plain text field, and then use the built-in MySQL full-text search mechanism to index and search that, thus taking advantage of all the development already devoted to speeding up general-purpose full text search.
I wonder how it would scale to Wikipedia's vast datasets?
Well, the point is that that solution exactly has already been discussed extensively on this list in exactly this context, and the answer is "not clear". Performance in various tests (primarily if not exclusively by Aerik) was not terrible but might or might not be good enough. The thought was to use something like Lucene instead, which would probably be faster. The normalized performance was also tested here, I think, by Greg Maxwell, and he concluded that PostgreSQL was much faster (which this paper agrees with).
The interesting part is their remark that basically they have no idea why fulltext is actually faster, since in principle it shouldn't be. It makes me wonder if we won't have more effort by RDBMSes in the future to implement efficient indexes for this kind of thing.
One thing that needs to be kept in mind is that their data set had 50,000 tag-item associations. This is about 500 times smaller than enwiki's categorylinks. Scalability data with smaller and larger data sets would have made an interesting addition to the paper.