On Fri, Mar 7, 2008 at 5:00 PM, Magnus Manske magnusmanske@googlemail.com wrote:
Well, unless you have a better idea...? Yes, there's fulltext, but is that really the most efficient way? I certainly isn't the most elegant;-)
Manually calculating all intersections beforehand isn't terribly elegant either. :)
Huh. Remind me again, why are we stuck at 4.0.26?
Later versions mostly gain new features that contribute nothing to performance, which is what we care about. In most cases, MySQL internally rewrites subqueries to joins or worse anyway, so it's not like you gain much.
Try running EXPLAIN on the query. I'd expect to see a temporary table created with intermediate results at each step: want to prove me wrong?
For more intersections, I see three ways out:
- Move to MySQL 4.1 :-)
Almost certainly doesn't help, as I said. I would be surprised if subqueries worked with any kind of remotely acceptable efficiency.
- Request complete list for every A|B combination, and merge in PHP.
Too much memory usage ()?
And too much network latency, and too much network traffic. You're talking about megabytes being sent either way in the worst case, with intersections of hundreds of thousands (consider Men intersect Living people, with flattened categories so that Men is properly populated).
- Store "higher order" hashes as well - A|B|C, A|B|C|D... Waste of diskspace?
The most extreme would be to just store every possible intersection individually. This isn't quite as insane as it sounds, since almost all intersections would be empty, and so not stored. It would result in exponential storage requirements in the number of rows per article, however, specifically 2^n - 1 rows per article (the number of nonzero bit patterns in n bits, if you like). For an article with 50 or even 20 categories this would obviously be ludicrous, so some compromise would be necessary. If you stored up to four or five, it would only be five or four joins for even a 20-way intersection, but I haven't calculated the storage requirements of that (got to go in a minute). Nor am I sure of the efficiency in other ways.
It's not like fulltext is O(1) in the number of search terms, or the number of categories per page. Maybe I'm totally wrong. But I expect it to work better, is all. It will certainly handle subtractions, which are valuable and not conceivably possible to cover in your scheme, as far as I can tell. It's also a proven technique used by more than one popular software product for tag intersection, AFAIK.
Do you think we can rescue this?
Well, I'm not going to tell you what to spend your time on. It's always good to have a plurality of ideas. It's not the approach I'd focus on first, but what do I know?