On Tue, Apr 22, 2008 at 9:39 AM, Roan Kattouw roan.kattouw@home.nl wrote:
MinuteElectron wrote a pretty good implementation of category intersection as an extension [1].
You left out the reference here, but if you're talking about this,
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CategoryIntersect...
then it was written by Magnus. I've already commented on it somewhat.
The only downsides I see are:
- It uses the LinkUpdater to gradually build the categoryintersections
table, but there's no maintenance script to build the entire table at once. I've written one today, but haven't figured out yet how to properly integrate this into an extension (I can't really get the path to the maintenance dir from there; are there any other extensions with CLI scripts around?)
Mostly we just assume that we're in wikiroot/extensions/.
As to the subquery thing, I'll describe how the extension fetches pages that are in categories A, B and C (all three of them). First, it calculates hashes for A|B, A|C and B|C (will be called hashAB, hashBC and hashAC respectively). Then, it queries the categoryintersections table for pages that have all three hashes, as follows:
SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAB' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashBC' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAC' ) )
As I've commented before: this query won't work on MySQL 4, so it can't be in core (unless perhaps disabled by default, or intelligently auto-enabled depending on SQL engine). It will also probably run very poorly on higher versions of MySQL, since MySQL is stupid about rewriting subqueries into joins. This should be written as a join:
SELECT ci_page FROM categoryintersection AS ci1 JOIN categoryintersection AS ci2 ON ci1.ci_page = ci2.ci_page WHERE ci1.ci_hash = 'hashAB' AND ci2.ci_hash = 'hashBC'
Note that you don't need the third table at all; if something is in A intersect B and in B intersect C, it's automatically in A intersect C as well.
In this case it's an extremely fast query, but that's because there are only one or two rows returned for each result. In the worst case of an empty match, or a match with fewer results than the LIMIT, it will have to scan through both intersections in their entirety, which may well be thousands of rows. It's also much less powerful than a Boolean full-text search: it can't handle subtractions and will probably have to be crippled to some small number of intersections.
Also, I'd like to suggest we merge this extension into core (after improving it first), thoughts?
Well, if someone writes an interface for category intersections, it might be reasonable to have multiple backends in core, given that the backend will be flexible anyway. One advantage of Magnus' scheme is that it will work in pretty much any SQL engine with no modification (at least if rewritten to eliminate advanced features like subqueries ;) ). The alternative suggestion of a fulltext engine will work only on MySQL 5, and possibly PostgreSQL (at least with appropriate extra coding).