New subject: Category Intersections

15 Sep 2006


      Simetrical and Timwi wrote:
...
...
Besides, I think you're all forgetting that if we have a table that
stores, say, all two-way category intersections, we can actually get rid
of the categorylinks table itself -- it would be contained within that
new table and would be wholly redundant. Similarly, a table with all
three-way intersections contains in it all two-way intersections as well.
Yeah, but then surely you'd have to take the union of a potentially
large number of tables to display a single-category view, which I
suspect is going to remain a more common request than a
category-intersection view.  Isn't that going to give you a
substantial performance hit for large categories?
(And incidentally, pages that are only in a single category will have
to remain in their own table.  You can't get that from intersection
tables.)
Whoa - intersection "tables"?  Why more than one?  You lost me.  Also,
I figured you could get higher order intersections from lower ones,
but how do you go the other way?  And, I second the question about the
performance hit, even if you can.
I was picturing adding the intersections to the existing table - like
this, suppose you have an article in "Living people", "People from
Maine", "American horror writers" - it would also get the category
intersections of "Living people ~ People from Maine", "People from
Maine ~ American horror writers", etc.  You would not need "American
horror writers ~ People from Maine" because it is the same
intersection.  So I can see how you could intersect "Living people"
and  "People from Maine ~ American horror writers" and get the
intersection of all three - but you would still need to store "Living
people ~ People from Maine" AND "People from Maine ~ American horror
writers" - it doesn't save you from having to store every combination
of 2 categories.
The plus side is this lets you calculate the intersection of 4
categories dynamically, or any 3, or any 2, with only storing the a
binomial coefficient (n choose k, where n is the number of categories
and k=2) additional records, and the retrieval of articles at the
intersection of only two categories will be fast.  The logic for what
to retrievee and store gets a little convoluted, but not too bad... is
the performance improvement worth it?
I don't know enough about SQL to talk intelligently about the speed of
retrieving one record in two thousand vs. two million, but I agree,
some testing is called for here.  If someone can point me where I can
download the categorylinks table (*only* preferably) I'll get it and
throw some sql at it.  It'll take me a couple of days to get around
to.
Best Regards,
Aerik

Re: [Wikitech-l] Category Intersections