On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske magnusmanske@googlemail.com wrote:
(feel free to bash me if we had this variant already, I couldn't find it in the list archives)
Task: On German Wikipedia (yay atomic categories!), find women who were born in 1901 and died in 1986. Runtime : Toolserver, <2 sec Query: SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" , "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE tbl1.cnt = 3 ;
This will fail with a syntax error on the main servers, because subqueries aren't supported in MySQL 4.0. You don't really need the subquery, though; you should be able to just use HAVING:
SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY cl_from HAVING COUNT(cl_to) = 3;
Your solution requires filesorting the union of the categories, as far as I can tell. I would expect it, offhand, to be significantly slower than a solution using joins:
SELECT page_title FROM page JOIN categorylinks AS cl1 ON page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to = 'Gestorben_1986';
But I haven't benchmarked it, and who knows what kind of execution quirks are happening here.
Trying to "poison" the query by also looking in all GFDL images ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec., so not that bad.
3 seconds is a very long time for a query to run. Typical queries should take more like, say, 10 ms. Occasional selects taking three seconds might or might not kill the servers, but they're far from optimal. Also, did you try in a really worst-case scenario, like intersecting "Unprintworthy redirects" with "Stub-Class biography articles" on enwiki? Obviously users aren't likely to legitimately run an intersection of those exact categories (since they're logically disjoint), but you should test this kind of thing to ensure scalability. The query appears to take 16s on your tool.
Again, the only really scalable solution looks to be fulltext search of some kind. We've known for a long time that category intersections can easily be done well enough, for a modest standard of "well enough", but that hasn't been considered good enough to run on Wikipedia.