Timwi wrote
You method would:
- retrieve all pages from category1
- retrieve all pages from category2
- sort the whole list
- group them (i.e. replace runs with their counts)
- filter out those where the count is 2
The straightfoward method would be:
select a.page from categorylinks a where a.category='category1' AND EXISTS ( select * from categorylinks b where a.page=b.page AND b.category='category2' )
This would:
- retrieve all the pages from category2
- find all the pages in category1 that are any of those
I don't have MySQL handy, but tested this on MSSQL and the second method was way faster. :)
Timwi
Just out of curiousity, I tried it in MySQL with real data (25,000 rows of categorylinks - not wikipedia, but something). I cycle through each type of query 100 times, and did no other processing besides running the loop and executing the query. Each way took about 20 seconds for the 100 loops. I also ran a control query ("SELECT * FROM categorylinks LIMIT 1") and it took lik .05 seconds, and then .014 or something - it must be caching the query. I didn't try it with the joins yet. So it looks like MySQL handles it differently than MSSQL. I wonder if there's a better way than both of these (maybe using the joins? I'll have to try it.).
Best Regards, Aerik
Just out of curiousity, I tried it in MySQL with real data (25,000 rows of categorylinks - not wikipedia, but something). I cycle through each type of query 100 times, and did no other processing besides running the loop and executing the query. Each way took about 20 seconds for the 100 loops. I also ran a control query ("SELECT * FROM categorylinks LIMIT 1") and it took lik .05 seconds, and then .014 or something - it must be caching the query. I didn't try it with the joins yet. So it looks like MySQL handles it differently than MSSQL. I wonder if there's a better way than both of these (maybe using the joins? I'll have to try it.).
As you already mentioned, it might be caching the query. If you run it 100 times, maybe it takes some time to run the query once, and then a constant time to look up the result in the cache 99 times.
wikitech-l@lists.wikimedia.org