Simetrical schreef:
The only downsides I see are:
- It uses the LinkUpdater to gradually build the categoryintersections
table, but there's no maintenance script to build the entire table at once. I've written one today, but haven't figured out yet how to properly integrate this into an extension (I can't really get the path to the maintenance dir from there; are there any other extensions with CLI scripts around?)
Mostly we just assume that we're in wikiroot/extensions/.
I'll do that then, but it's still somewhat creepy.
As to the subquery thing, I'll describe how the extension fetches pages that are in categories A, B and C (all three of them). First, it calculates hashes for A|B, A|C and B|C (will be called hashAB, hashBC and hashAC respectively). Then, it queries the categoryintersections table for pages that have all three hashes, as follows:
SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAB' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashBC' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAC' ) )
As I've commented before: this query won't work on MySQL 4, so it can't be in core (unless perhaps disabled by default, or intelligently auto-enabled depending on SQL engine). It will also probably run very poorly on higher versions of MySQL, since MySQL is stupid about rewriting subqueries into joins. This should be written as a join:
SELECT ci_page FROM categoryintersection AS ci1 JOIN categoryintersection AS ci2 ON ci1.ci_page = ci2.ci_page WHERE ci1.ci_hash = 'hashAB' AND ci2.ci_hash = 'hashBC'
Didn't think of that.
Note that you don't need the third table at all; if something is in A intersect B and in B intersect C, it's automatically in A intersect C as well.
We probably need to hash generation functions then: one that generates all hashes corresponding to a certain page (AB, AC and BC), and one which generates hashes for a query (AB and BC only in this case).
In this case it's an extremely fast query, but that's because there are only one or two rows returned for each result. In the worst case of an empty match, or a match with fewer results than the LIMIT, it will have to scan through both intersections in their entirety, which may well be thousands of rows. It's also much less powerful than a Boolean full-text search: it can't handle subtractions and will probably have to be crippled to some small number of intersections.
Also, I'd like to suggest we merge this extension into core (after improving it first), thoughts?
Well, if someone writes an interface for category intersections, it might be reasonable to have multiple backends in core, given that the backend will be flexible anyway. One advantage of Magnus' scheme is that it will work in pretty much any SQL engine with no modification (at least if rewritten to eliminate advanced features like subqueries ;) ). The alternative suggestion of a fulltext engine will work only on MySQL 5, and possibly PostgreSQL (at least with appropriate extra coding).
I missed the explanation of the fulltext implementation. Something like 'Foo With_spaces Bar' and then do a fulltext search for the cats you need? That would be more powerful, and would probably be faster for complex intersections. I'll write an alternative to CategoryIntersections that uses the fulltext schema and run some benchmarks. I expect to have some results by the end of the week.
Roan Kattouw (Catrope)