On Tue, Apr 22, 2008 at 9:39 AM, Roan Kattouw <roan.kattouw(a)home.nl> wrote:
MinuteElectron wrote a pretty good implementation of
category
intersection as an extension [1].
You left out the reference here, but if you're talking about this,
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CategoryIntersec…
then it was written by Magnus. I've already commented on it somewhat.
The only downsides I see are:
* It uses the LinkUpdater to gradually build the categoryintersections
table, but there's no maintenance script to build the entire table at
once. I've written one today, but haven't figured out yet how to
properly integrate this into an extension (I can't really get the path
to the maintenance dir from there; are there any other extensions with
CLI scripts around?)
Mostly we just assume that we're in wikiroot/extensions/.
As to the subquery thing, I'll describe how the
extension fetches pages
that are in categories A, B and C (all three of them). First, it
calculates hashes for A|B, A|C and B|C (will be called hashAB, hashBC
and hashAC respectively). Then, it queries the categoryintersections
table for pages that have all three hashes, as follows:
SELECT ci_page FROM categoryintersections
WHERE ci_hash = 'hashAB' AND ci_page IN (
SELECT ci_page FROM categoryintersections
WHERE ci_hash = 'hashBC' AND ci_page IN (
SELECT ci_page FROM categoryintersections
WHERE ci_hash = 'hashAC'
)
)
As I've commented before: this query won't work on MySQL 4, so it
can't be in core (unless perhaps disabled by default, or intelligently
auto-enabled depending on SQL engine). It will also probably run very
poorly on higher versions of MySQL, since MySQL is stupid about
rewriting subqueries into joins. This should be written as a join:
SELECT ci_page FROM categoryintersection AS ci1
JOIN categoryintersection AS ci2 ON ci1.ci_page = ci2.ci_page
WHERE ci1.ci_hash = 'hashAB' AND ci2.ci_hash = 'hashBC'
Note that you don't need the third table at all; if something is in A
intersect B and in B intersect C, it's automatically in A intersect C
as well.
In this case it's an extremely fast query, but that's because there
are only one or two rows returned for each result. In the worst case
of an empty match, or a match with fewer results than the LIMIT, it
will have to scan through both intersections in their entirety, which
may well be thousands of rows. It's also much less powerful than a
Boolean full-text search: it can't handle subtractions and will
probably have to be crippled to some small number of intersections.
Also, I'd like to suggest we merge this extension
into core
(after improving it first), thoughts?
Well, if someone writes an interface for category intersections, it
might be reasonable to have multiple backends in core, given that the
backend will be flexible anyway. One advantage of Magnus' scheme is
that it will work in pretty much any SQL engine with no modification
(at least if rewritten to eliminate advanced features like subqueries
;) ). The alternative suggestion of a fulltext engine will work only
on MySQL 5, and possibly PostgreSQL (at least with appropriate extra
coding).