On Fri, Feb 29, 2008 at 5:28 AM, Magnus Manske magnusmanske@googlemail.com wrote:
I just had the following thought: For a tag intersection system,
- we limit queries to two intersections (show all pages with categories A and B)
- we assume on average 5 categories per page (can someone check that?)
Actually, there are on average only 2 categories per page, it seems: 12,000,000 page rows and 25,000,000 category rows. Of course, that doesn't just count articles.
then we have 5*4=20 intersections per page.
You're double-counting. It's only 10 intersections: A|B is the same as B|A. (For the purposes of storage, of course, you'd want to consistently pick one of the two categories to go first in the hashed string, say the ASCIIbetically first. So you'd store B|A as A|B, since A < B. Or whatever.)
Now, for each intersection, we calculate MD5("A|B") to get an integer hash, and store that in a new table (page_id INTEGER,intersection_hash INTERGER). That table would be 4 times as long as the categorylinks table.
It would be smaller, probably. Categorylinks is int, varchar(255), varchar(86), timestamp, which is 4 + ? + ? + 4 bytes, with the ?'s being pretty large. And that's duplicated a few times for various indexes. Even if you go for a 64-bit hash, your table data is still only 12 bytes per row, with the primary key (page_id, intersection_hash) taking no additional storage and a reverse key taking another 24 (or maybe 12? is it that smart?) bytes per row.
For a point of comparison, on my old dump of simple-wiki, categorylinks has 39671, and the length is 3637248 for data and 4980736 for indexes. That makes 217 bytes per row average. The table you propose should be only 36 bytes per row, if I'm not wrong, plus I guess some percentage overhead for various things. I think the percentage will be pretty large: an InnoDB table I have that's just (int, int, int) is 53 bytes of data per row, not just 12. But still, your table should be considerably smaller per row than categorylinks.
Of course, the number of rows used is not going to be proportional to the average number of rows per page, but the average of the square of the number of rows per page. Specifically it will be equal to the sum over all pages of n(n-1)/2 (where n is the number of categories on that page), so pages with one category contribute nothing, but those with 20 categories will give you an extra 190 entries. So whether it will be more or fewer than four times the rows depends on the distribution of categories. Still, it should be not much larger than categorylinks, at worst.
The idea is interesting, and might have merit. Certainly simple lookups would be very fast, even lookups of three or perhaps four categories. It's not quite as flexible as fulltext search (where you can do arbitrarily large intersections, and subtractions too), but it could be that it's faster. I don't *see* any pitfalls.