Simetrical and Timwi wrote:
Besides, I think you're all forgetting that if we have a table that stores, say, all two-way category intersections, we can actually get rid of the categorylinks table itself -- it would be contained within that new table and would be wholly redundant. Similarly, a table with all three-way intersections contains in it all two-way intersections as well.
Yeah, but then surely you'd have to take the union of a potentially large number of tables to display a single-category view, which I suspect is going to remain a more common request than a category-intersection view. Isn't that going to give you a substantial performance hit for large categories?
(And incidentally, pages that are only in a single category will have to remain in their own table. You can't get that from intersection tables.)
Whoa - intersection "tables"? Why more than one? You lost me. Also, I figured you could get higher order intersections from lower ones, but how do you go the other way? And, I second the question about the performance hit, even if you can.
I was picturing adding the intersections to the existing table - like this, suppose you have an article in "Living people", "People from Maine", "American horror writers" - it would also get the category intersections of "Living people ~ People from Maine", "People from Maine ~ American horror writers", etc. You would not need "American horror writers ~ People from Maine" because it is the same intersection. So I can see how you could intersect "Living people" and "People from Maine ~ American horror writers" and get the intersection of all three - but you would still need to store "Living people ~ People from Maine" AND "People from Maine ~ American horror writers" - it doesn't save you from having to store every combination of 2 categories.
The plus side is this lets you calculate the intersection of 4 categories dynamically, or any 3, or any 2, with only storing the a binomial coefficient (n choose k, where n is the number of categories and k=2) additional records, and the retrieval of articles at the intersection of only two categories will be fast. The logic for what to retrievee and store gets a little convoluted, but not too bad... is the performance improvement worth it?
I don't know enough about SQL to talk intelligently about the speed of retrieving one record in two thousand vs. two million, but I agree, some testing is called for here. If someone can point me where I can download the categorylinks table (*only* preferably) I'll get it and throw some sql at it. It'll take me a couple of days to get around to.
Best Regards, Aerik
On 9/14/06, Aerik Sylvan aerik@thesylvans.com wrote:
Whoa - intersection "tables"? Why more than one? You lost me.
Okay: a table.
Also, I figured you could get higher order intersections from lower ones, but how do you go the other way?
Assume you have a table of all two-category intersections. If you want to get a list of articles containing A, take the union of all intersections of A with any other category. That will only omit articles whose only category is A.
If someone can point me where I can download the categorylinks table (*only* preferably) I'll get it and throw some sql at it. It'll take me a couple of days to get around to.
You want http://download.wikimedia.org/enwiki/20060911/enwiki-20060911-categorylinks.... (108 MB gzipped). download.wikimedia.org for all your data dump needs. :)
wikitech-l@lists.wikimedia.org