Re: [Wikitech-l] The never-dying topic: category intersection

2 Dec 2008


      On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske
magnusmanske@googlemail.com wrote:
...
(feel free to bash me if we had this variant already, I couldn't find
it in the list archives)
Task: On German Wikipedia (yay atomic categories!), find women who
were born in 1901 and died in 1986.
Runtime : Toolserver, <2 sec
Query:
SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
"Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
tbl1.cnt = 3 ;
This will fail with a syntax error on the main servers, because
subqueries aren't supported in MySQL 4.0.  You don't really need the
subquery, though; you should be able to just use HAVING:
SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND
cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY
cl_from HAVING COUNT(cl_to) = 3;
Your solution requires filesorting the union of the categories, as far
as I can tell.  I would expect it, offhand, to be significantly slower
than a solution using joins:
SELECT page_title FROM page JOIN categorylinks AS cl1 ON
page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from
JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE
cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to =
'Gestorben_1986';
But I haven't benchmarked it, and who knows what kind of execution
quirks are happening here.
...
Trying to "poison" the query by also looking in all GFDL images
("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
so not that bad.
3 seconds is a very long time for a query to run.  Typical queries
should take more like, say, 10 ms.  Occasional selects taking three
seconds might or might not kill the servers, but they're far from
optimal.  Also, did you try in a really worst-case scenario, like
intersecting "Unprintworthy redirects" with "Stub-Class biography
articles" on enwiki?  Obviously users aren't likely to legitimately
run an intersection of those exact categories (since they're logically
disjoint), but you should test this kind of thing to ensure
scalability.  The query appears to take 16s on your tool.
Again, the only really scalable solution looks to be fulltext search
of some kind.  We've known for a long time that category intersections
can easily be done well enough, for a modest standard of "well
enough", but that hasn't been considered good enough to run on
Wikipedia.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] The never-dying topic: category intersection