Re: [Wikitech-l] Category Intersections - some early testing

28 Sep 2006


      On 9/27/06, Stephen Bain stephen.bain@gmail.com wrote:
...
Both Aerik and Greg were running on their own machines with local
copies of the db (or parts thereof). To really get a comparison you'd
have to run tests on the same machine, running the same queries.
Squabbling over favourite software can wait until after they are
compared side by side.
MySQL can not run all of the same queries because it is less powerful
software, which is my point.. it's not a matter of favorites.
I did test MySQL on the pathological case of intersecting two huge
categories. I gave up after it chewed on it for five minutes.
The question is one of computational order.
On MySQL the only obvious and reasonable way to accomplish
intersections is via an inner join.
MySQL will execute this with an index lookup on the first cat match
followed by a loop testing each of the returned el_froms against the
second cat.  This is very slow if the first selected cat is large.
Given the same query innerjoin query Postgresql will separately build
in memory bitmaps for the results of the index lookups and then apply
the intersections to the bitmaps, then read the results out sequent
ally.   This is much faster in the worst case, and similar in the easy
cases.
But postgresql goes further because it also supports both inverted
indexes, and Russian Doll tree indexes for multivalue fields.  Both
these methods turn our intersections into very cheap operations with
something like O(members log members) behavior and a very low
constant.  ... The case of intersecting two huge cats takes just
miliseconds... the only thing kinda slow you can have it do is produce
a hundred thousand results, which causes MVCC visibility to dominate
the query time).
Both inverted index and RD-Tree would have had similar performance
with only a few million rows. I used an inverted index in my example
because it is what I use for analysis work... I often use indexes on
links and images against *all revisions* of all pages. Inverted index
scales better for that, keeping most operations subsecond.
...
I think the question Aerik was posing was what factors make some cat
intersections take longer than others, not what makes it take a
certain number value.
Didn't domas post last time this came up?  As I said above with
*mysql* you can't avoid the loop over the results of half the
lookup... thats just going to be slow sometimes. Hopefully the
database will be smart and loop over the smaller of the two, but I
wouldn't count on it... and the case of two large cats is just going
to stink.
So here is the straight dope:
1) Using joins for this will never be blindingly fast in the worst case.
2) MySQL joins are slower than they could be because MySQL doesn't
support in memory bitmaps yet (although thats coming).
3) Other databases offer alternative indexing schemes which are fast
in all cases.
So long as we are staying on MySQL (which is a good for many reasons,
just not at all good for things which are more complex than simple web
traffic), the only way for us to get acceptable results for this is
going to be to call an outside piece of software exactly as we've done
with Lucine for full text search.
It might be wise of us to instead consider figuring out how to
integrate a link to PostgreSQL instead, which not only can do inverted
index based full text search like Lucine, but can support our need for
things like intersections, and future applications like fast
geospatial queries.  (Yes, mysql can do some of these things a bit
better if you give up reliability and use MyISAM table type rather
than innodb)...

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category Intersections - some early testing