Sorry - here are the queries I was playing with - I had intended to send a rather more detailed and thought out analysis, but haven't had time in the last couple of days to play with it much more. These times are from running the queries on phpMyAdmin on my home computer that wasn't doing anything else strenuous at the time. As you can see, I used count... group by instead of join. They could be the same, inside MySQL, but it doesn't seem like it. I tried doing some large joins last night, and they never returned any results, but that could be an instability in my computer - I need to try it again.
Just to get this out of the way: These number are of course, only relevant to themselves, but I was thinking of comparing the times to return intersections against the time it takes to do something that is known on Wikipedia (say, get the first 200 results of the "Living_People" category?) and try to extrapolate a meaningful estimate.
Showing rows 0 - 15 (16 total, Query took 0.3316 sec) SQL query: SELECT cl_sortkey, count( * ) AS catcount FROM `categorylinks` WHERE `cl_to` = 'Fantasy_films' OR `cl_to` = 'Disney_films ' GROUP BY cl_sortkey HAVING catcount =2 LIMIT 0 , 30
Showing rows 0 - 9 (10 total, Query took 0.4407 sec) SQL query: SELECT cl_sortkey, count( * ) AS catcount FROM `categorylinks` WHERE `cl_to` = 'United_States_Army_soldiers' OR `cl_to` = 'German-Americans ' GROUP BY cl_sortkey HAVING catcount =2 LIMIT 0 , 30
Showing rows 0 - 19 (20 total, Query took 0.8908 sec) SQL query: SELECT cl_sortkey, count( * ) AS catcount FROM `categorylinks` WHERE `cl_to` = 'Drama_films' OR `cl_to` = 'World_War_II_films ' GROUP BY cl_sortkey HAVING catcount =2 LIMIT 0 , 30
Showing rows 0 - 29 (2,169 total, Query took 36.3109 sec) SQL query: SELECT cl_sortkey, count( * ) AS catcount FROM `categorylinks` WHERE `cl_to` = 'Prisoners_of_war' OR `cl_to` = 'Living_people' GROUP BY cl_sortkey HAVING catcount =2 LIMIT 0 , 30
Showing rows 0 - 1 (2 total, Query took 8.6470 sec) SQL query: SELECT cl_sortkey, count( * ) AS catcount FROM `categorylinks` WHERE `cl_to` = 'Articles_with_unsourced_statements' OR `cl_to` = 'American_World_War_II_veterans ' GROUP BY cl_sortkey HAVING catcount =2 LIMIT 0 , 30
Aerik Sylvan wrote:
Showing rows 0 - 15 (16 total, Query took 0.3316 sec) SQL query: SELECT cl_sortkey, count( * ) AS catcount FROM `categorylinks` WHERE `cl_to` = 'Fantasy_films' OR `cl_to` = 'Disney_films ' GROUP BY cl_sortkey HAVING catcount =2 LIMIT 0 , 30
I thought I had mentioned before (maybe not to you) that this query is unnecessarily inefficient (because it scans both categories completely). Why do you use this adventurous approach when there is a much simpler and more obvious one, which also happens to be faster?
SELECT c.cl_from FROM categorylinks c WHERE c.cl_from IN ( SELECT c2.cl_from FROM categorylinks c2 WHERE c2.cl_to='Fantasy_films' ) AND c.cl_to='Disney_films'
wikitech-l@lists.wikimedia.org