Simetrical schreef:
The only downsides I see are:
* It uses the LinkUpdater to gradually build the categoryintersections
table, but there's no maintenance script to build the entire table at
once. I've written one today, but haven't figured out yet how to
properly integrate this into an extension (I can't really get the path
to the maintenance dir from there; are there any other extensions with
CLI scripts around?)
Mostly we just assume that we're in wikiroot/extensions/.
I'll do that then, but it's still somewhat creepy.
As to the subquery thing, I'll describe how
the extension fetches pages
that are in categories A, B and C (all three of them). First, it
calculates hashes for A|B, A|C and B|C (will be called hashAB, hashBC
and hashAC respectively). Then, it queries the categoryintersections
table for pages that have all three hashes, as follows:
SELECT ci_page FROM categoryintersections
WHERE ci_hash = 'hashAB' AND ci_page IN (
SELECT ci_page FROM categoryintersections
WHERE ci_hash = 'hashBC' AND ci_page IN (
SELECT ci_page FROM categoryintersections
WHERE ci_hash = 'hashAC'
)
)
As I've commented before: this query won't work on MySQL 4, so it
can't be in core (unless perhaps disabled by default, or intelligently
auto-enabled depending on SQL engine). It will also probably run very
poorly on higher versions of MySQL, since MySQL is stupid about
rewriting subqueries into joins. This should be written as a join:
SELECT ci_page FROM categoryintersection AS ci1
JOIN categoryintersection AS ci2 ON ci1.ci_page = ci2.ci_page
WHERE ci1.ci_hash = 'hashAB' AND ci2.ci_hash = 'hashBC'
Didn't think of that.
Note that you don't need the third table at all;
if something is in A
intersect B and in B intersect C, it's automatically in A intersect C
as well.
We probably need to hash generation functions then: one that generates
all hashes corresponding to a certain page (AB, AC and BC), and one
which generates hashes for a query (AB and BC only in this case).
In this case it's an extremely fast query, but
that's because there
are only one or two rows returned for each result. In the worst case
of an empty match, or a match with fewer results than the LIMIT, it
will have to scan through both intersections in their entirety, which
may well be thousands of rows. It's also much less powerful than a
Boolean full-text search: it can't handle subtractions and will
probably have to be crippled to some small number of intersections.
Also, I'd like to suggest we merge this
extension into core
(after improving it first), thoughts?
Well, if someone writes an interface for category intersections, it
might be reasonable to have multiple backends in core, given that the
backend will be flexible anyway. One advantage of Magnus' scheme is
that it will work in pretty much any SQL engine with no modification
(at least if rewritten to eliminate advanced features like subqueries
;) ). The alternative suggestion of a fulltext engine will work only
on MySQL 5, and possibly PostgreSQL (at least with appropriate extra
coding).
I missed the explanation of the fulltext implementation. Something like
'Foo With_spaces Bar' and then do a fulltext search for the cats you
need? That would be more powerful, and would probably be faster for
complex intersections. I'll write an alternative to
CategoryIntersections that uses the fulltext schema and run some
benchmarks. I expect to have some results by the end of the week.
Roan Kattouw (Catrope)