On Tue, Dec 2, 2008 at 2:57 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske magnusmanske@googlemail.com wrote:
(feel free to bash me if we had this variant already, I couldn't find it in the list archives)
Task: On German Wikipedia (yay atomic categories!), find women who were born in 1901 and died in 1986. Runtime : Toolserver, <2 sec Query: SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" , "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE tbl1.cnt = 3 ;
This will fail with a syntax error on the main servers, because subqueries aren't supported in MySQL 4.0. You don't really need the subquery, though; you should be able to just use HAVING:
SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY cl_from HAVING COUNT(cl_to) = 3;
Your're right. Fixed in the tool.
Your solution requires filesorting the union of the categories, as far as I can tell. I would expect it, offhand, to be significantly slower than a solution using joins:
SELECT page_title FROM page JOIN categorylinks AS cl1 ON page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to = 'Gestorben_1986';
But I haven't benchmarked it, and who knows what kind of execution quirks are happening here.
It seems the JOIN query is significantly faster when all categories are large.
However, with one or more small categories, I can do a pre-selection of pages (get page_ids for the intersection of the small categories, then look only for these in the larger ones), which in turn is significantly faster than the JOIN. My tool now uses the algorithm appropriate for the respective query.
Trying to "poison" the query by also looking in all GFDL images ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec., so not that bad.
3 seconds is a very long time for a query to run. Typical queries should take more like, say, 10 ms. Occasional selects taking three seconds might or might not kill the servers, but they're far from optimal.
I am uncertain how much the toolserver factors in here. The poor thing is under a lot of stress ;-)
Also, did you try in a really worst-case scenario, like intersecting "Unprintworthy redirects" with "Stub-Class biography articles" on enwiki? Obviously users aren't likely to legitimately run an intersection of those exact categories (since they're logically disjoint), but you should test this kind of thing to ensure scalability. The query appears to take 16s on your tool.
I ran it again now, and it falls back to the JOIN solution, taking ~10 sec. As a worst-case scenario, I call that acceptable for the tool.
It might not be acceptable for Wikipedia ATM. We could experiment how this performs on the "real" servers, though.
Also, we could restrict certain queries. We know the category size, and in my approach, we know how many articles are in the "small category" intersection. Form there, we could guesstimate the worst-case time, and kill the query, or run it in MySQL slow mode (forgot the correct name) to not stress the servers too much.
Again, the only really scalable solution looks to be fulltext search of some kind. We've known for a long time that category intersections can easily be done well enough, for a modest standard of "well enough", but that hasn't been considered good enough to run on Wikipedia.
No matter what method, I think the problem should get high priority. I currently see a case on Commons, where there's now "Category:Paintings by Vincent van Gogh in this-and-that-museum". It's getting ridiculous (or is already there).
Magnus
P.S.: Just got a message on my Commons talk page about the "+incategory:" search function. * This is based on the Lucene index, right? How often is that updated? * Is there a decent interface/special page for that? It's a pain to enter this manually, and I doubt many people know about it * Is there a machine-readable interface for this? One that will return 5K hits without screenscraping?