Rob Church wrote:
The classic intersections use UNION queries which are too expensive; this is part of the reason we don't have a straightforward intersecting search, and hence are having to investigate other means.
It's not a union actually, it's a straight join. An OR operation is a union, both in set theory and in SQL.
To get an idea of the potential performance problems, I made a list of the categories with the most members on commons.
563713 GFDL 405084 Self-published_work 218472 PD-self 205838 PD_Old 139486 CC-BY-SA-2.5,2.0,1.0 125657 CC-BY-SA-2.5 117172 CC-BY-2.5 38695 PD-user 29312 CC-BY-2.0 22821 Insignia 21902 CC-BY-SA-2.0 20906 User-created_GFDL_images
How long does it take to calculate the intersection between the two biggest categories? On lomaria:
mysql> select count(*) from categorylinks as c1, categorylinks as c2 where c1.cl_to='GFDL' and c2.cl_to='Self-published_work' and c1.cl_from=c2.cl_from; +----------+ | count(*) | +----------+ | 317870 | +----------+ 1 row in set (1.94 sec)
And again immediately afterwards, with a comment to override the query cache, but testing the index cache:
mysql> select /**/ count(*) from categorylinks as c1, categorylinks as c2 where c1.cl_to='GFDL' and c2.cl_to='Self-published_work' and c1.cl_from=c2.cl_from; +----------+ | count(*) | +----------+ | 317870 | +----------+ 1 row in set (1.91 sec)
Which is really not that bad, considering it is the worst case. The index size for commonswiki.categorylinks is 670MB, which our DB servers shouldn't have too much trouble with at present. Indeed, the query would be a lot slower if it were hitting the disk. Maybe it is practical, as long as we put it in a query group so that it doesn't crash the whole s2 cluster if someone decides to put such an intersection on the commons main page.
Intersecting GFDL with various other categories down the list, here is the timing after priming the cache:
PD-self: 1.17s CC-BY-SA-2.5,2.0,1.0: 0.75s PD-user: 0.22s Insignia: 0.15s User-created_GFDL_images: 0.10s
The figures show that the timing depends on the size of each category not on the size of the intersection, which is what you'd expect. The intersection for PD-self had 1442 rows, and the intersection for "CC-BY-SA-2.5,2.0,1.0" had 135107 rows, but the timing was similar.
I use cache-primed figures (i.e. the second attempt), because this better reflects the worst case scenario of high request rates. I'm assuming that the whole index would be held in memory under such circumstances.
A special page would be better, it would reduce the number of unnecessary requests. And there may be other performance problems with DPL that I'm not aware of. But on the basis of these results, it looks feasible.
-- Tim Starling