On Fri, Feb 29, 2008 at 5:28 AM, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
I just had the following thought: For a tag
intersection system,
* we limit queries to two intersections (show all pages with categories A and B)
* we assume on average 5 categories per page (can someone check that?)
Actually, there are on average only 2 categories per page, it seems:
12,000,000 page rows and 25,000,000 category rows. Of course, that
doesn't just count articles.
then we have 5*4=20 intersections per page.
You're double-counting. It's only 10 intersections: A|B is the same
as B|A. (For the purposes of storage, of course, you'd want to
consistently pick one of the two categories to go first in the hashed
string, say the ASCIIbetically first. So you'd store B|A as A|B,
since A < B. Or whatever.)
Now, for each intersection, we calculate
MD5("A|B") to get an integer
hash, and store that in a new table (page_id INTEGER,intersection_hash
INTERGER).
That table would be 4 times as long as the categorylinks table.
It would be smaller, probably. Categorylinks is int, varchar(255),
varchar(86), timestamp, which is 4 + ? + ? + 4 bytes, with the ?'s
being pretty large. And that's duplicated a few times for various
indexes. Even if you go for a 64-bit hash, your table data is still
only 12 bytes per row, with the primary key (page_id,
intersection_hash) taking no additional storage and a reverse key
taking another 24 (or maybe 12? is it that smart?) bytes per row.
For a point of comparison, on my old dump of simple-wiki,
categorylinks has 39671, and the length is 3637248 for data and
4980736 for indexes. That makes 217 bytes per row average. The table
you propose should be only 36 bytes per row, if I'm not wrong, plus I
guess some percentage overhead for various things. I think the
percentage will be pretty large: an InnoDB table I have that's just
(int, int, int) is 53 bytes of data per row, not just 12. But still,
your table should be considerably smaller per row than categorylinks.
Of course, the number of rows used is not going to be proportional to
the average number of rows per page, but the average of the square of
the number of rows per page. Specifically it will be equal to the sum
over all pages of n(n-1)/2 (where n is the number of categories on
that page), so pages with one category contribute nothing, but those
with 20 categories will give you an extra 190 entries. So whether it
will be more or fewer than four times the rows depends on the
distribution of categories. Still, it should be not much larger than
categorylinks, at worst.
The idea is interesting, and might have merit. Certainly simple
lookups would be very fast, even lookups of three or perhaps four
categories. It's not quite as flexible as fulltext search (where you
can do arbitrarily large intersections, and subtractions too), but it
could be that it's faster. I don't *see* any pitfalls.