On Fri, Feb 29, 2008 at 6:35 PM, Steve Sanbeg ssanbeg@ask.com wrote:
That was my thinking; that categories without a page ID are probably typos, and anyway less useful for intersection; if not, the articles could be added. So using the IDs and some recursion could be simpler and more scalable than using a hash.
I don't really see that it's simpler or more scalable, particularly. It does have moderately better locality of reference, although it's still not great. The denormalization means that half of a category's entries are scattered across the entire table, where it's in the second position, and only the half (on average) where it's first will be clustered in the same pages. I don't know if there's a very good reason to prefer either way.
Domas, do you have any thoughts on this scheme?