On 9/27/06, Stephen Bain stephen.bain@gmail.com wrote:
Both Aerik and Greg were running on their own machines with local copies of the db (or parts thereof). To really get a comparison you'd have to run tests on the same machine, running the same queries. Squabbling over favourite software can wait until after they are compared side by side.
MySQL can not run all of the same queries because it is less powerful software, which is my point.. it's not a matter of favorites.
I did test MySQL on the pathological case of intersecting two huge categories. I gave up after it chewed on it for five minutes.
The question is one of computational order.
On MySQL the only obvious and reasonable way to accomplish intersections is via an inner join.
MySQL will execute this with an index lookup on the first cat match followed by a loop testing each of the returned el_froms against the second cat. This is very slow if the first selected cat is large.
Given the same query innerjoin query Postgresql will separately build in memory bitmaps for the results of the index lookups and then apply the intersections to the bitmaps, then read the results out sequent ally. This is much faster in the worst case, and similar in the easy cases.
But postgresql goes further because it also supports both inverted indexes, and Russian Doll tree indexes for multivalue fields. Both these methods turn our intersections into very cheap operations with something like O(members log members) behavior and a very low constant. ... The case of intersecting two huge cats takes just miliseconds... the only thing kinda slow you can have it do is produce a hundred thousand results, which causes MVCC visibility to dominate the query time).
Both inverted index and RD-Tree would have had similar performance with only a few million rows. I used an inverted index in my example because it is what I use for analysis work... I often use indexes on links and images against *all revisions* of all pages. Inverted index scales better for that, keeping most operations subsecond.
I think the question Aerik was posing was what factors make some cat intersections take longer than others, not what makes it take a certain number value.
Didn't domas post last time this came up? As I said above with *mysql* you can't avoid the loop over the results of half the lookup... thats just going to be slow sometimes. Hopefully the database will be smart and loop over the smaller of the two, but I wouldn't count on it... and the case of two large cats is just going to stink.
So here is the straight dope:
1) Using joins for this will never be blindingly fast in the worst case. 2) MySQL joins are slower than they could be because MySQL doesn't support in memory bitmaps yet (although thats coming). 3) Other databases offer alternative indexing schemes which are fast in all cases.
So long as we are staying on MySQL (which is a good for many reasons, just not at all good for things which are more complex than simple web traffic), the only way for us to get acceptable results for this is going to be to call an outside piece of software exactly as we've done with Lucine for full text search.
It might be wise of us to instead consider figuring out how to integrate a link to PostgreSQL instead, which not only can do inverted index based full text search like Lucine, but can support our need for things like intersections, and future applications like fast geospatial queries. (Yes, mysql can do some of these things a bit better if you give up reliability and use MyISAM table type rather than innodb)...