On Fri, Mar 7, 2008 at 5:00 PM, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
Well, unless you have a better idea...?
Yes, there's fulltext, but is that really the most efficient way? I
certainly isn't the most elegant;-)
Manually calculating all intersections beforehand isn't terribly
elegant either. :)
Huh. Remind me again, why are we stuck at 4.0.26?
Later versions mostly gain new features that contribute nothing to
performance, which is what we care about. In most cases, MySQL
internally rewrites subqueries to joins or worse anyway, so it's not
like you gain much.
Try running EXPLAIN on the query. I'd expect to see a temporary table
created with intermediate results at each step: want to prove me
wrong?
For more intersections, I see three ways out:
* Move to MySQL 4.1 :-)
Almost certainly doesn't help, as I said. I would be surprised if
subqueries worked with any kind of remotely acceptable efficiency.
* Request complete list for every A|B combination,
and merge in PHP.
Too much memory usage ()?
And too much network latency, and too much network traffic. You're
talking about megabytes being sent either way in the worst case, with
intersections of hundreds of thousands (consider Men intersect Living
people, with flattened categories so that Men is properly populated).
* Store "higher order" hashes as well -
A|B|C, A|B|C|D... Waste of diskspace?
The most extreme would be to just store every possible intersection
individually. This isn't quite as insane as it sounds, since almost
all intersections would be empty, and so not stored. It would result
in exponential storage requirements in the number of rows per article,
however, specifically 2^n - 1 rows per article (the number of nonzero
bit patterns in n bits, if you like). For an article with 50 or even
20 categories this would obviously be ludicrous, so some compromise
would be necessary. If you stored up to four or five, it would only
be five or four joins for even a 20-way intersection, but I haven't
calculated the storage requirements of that (got to go in a minute).
Nor am I sure of the efficiency in other ways.
It's not like fulltext is O(1) in the number of search terms, or the
number of categories per page. Maybe I'm totally wrong. But I expect
it to work better, is all. It will certainly handle subtractions,
which are valuable and not conceivably possible to cover in your
scheme, as far as I can tell. It's also a proven technique used by
more than one popular software product for tag intersection, AFAIK.
Do you think we can rescue this?
Well, I'm not going to tell you what to spend your time on. It's
always good to have a plurality of ideas. It's not the approach I'd
focus on first, but what do I know?