On Tue, Dec 2, 2008 at 2:57 PM, Aryeh Gregor
<Simetrical+wikilist(a)gmail.com> wrote:
On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
(feel free to bash me if we had this variant
already, I couldn't find
it in the list archives)
Task: On German Wikipedia (yay atomic categories!), find women who
were born in 1901 and died in 1986.
Runtime : Toolserver, <2 sec
Query:
SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
"Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
tbl1.cnt = 3 ;
This will fail with a syntax error on the main servers, because
subqueries aren't supported in MySQL 4.0. You don't really need the
subquery, though; you should be able to just use HAVING:
SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND
cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY
cl_from HAVING COUNT(cl_to) = 3;
Your're right. Fixed in the tool.
Your solution requires filesorting the union of the
categories, as far
as I can tell. I would expect it, offhand, to be significantly slower
than a solution using joins:
SELECT page_title FROM page JOIN categorylinks AS cl1 ON
page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from
JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE
cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to =
'Gestorben_1986';
But I haven't benchmarked it, and who knows what kind of execution
quirks are happening here.
It seems the JOIN query is significantly faster when all categories are large.
However, with one or more small categories, I can do a pre-selection
of pages (get page_ids for the intersection of the small categories,
then look only for these in the larger ones), which in turn is
significantly faster than the JOIN.
My tool now uses the algorithm appropriate for the respective query.
Trying to
"poison" the query by also looking in all GFDL images
("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
so not that bad.
3 seconds is a very long time for a query to run. Typical queries
should take more like, say, 10 ms. Occasional selects taking three
seconds might or might not kill the servers, but they're far from
optimal.
I am uncertain how much the toolserver factors in here. The poor thing
is under a lot of stress ;-)
Also, did you try in a really worst-case scenario,
like
intersecting "Unprintworthy redirects" with "Stub-Class biography
articles" on enwiki? Obviously users aren't likely to legitimately
run an intersection of those exact categories (since they're logically
disjoint), but you should test this kind of thing to ensure
scalability. The query appears to take 16s on your tool.
I ran it again now, and it falls back to the JOIN solution, taking ~10
sec. As a worst-case scenario, I call that acceptable for the tool.
It might not be acceptable for Wikipedia ATM. We could experiment how
this performs on the "real" servers, though.
Also, we could restrict certain queries. We know the category size,
and in my approach, we know how many articles are in the "small
category" intersection. Form there, we could guesstimate the
worst-case time, and kill the query, or run it in MySQL slow mode
(forgot the correct name) to not stress the servers too much.
Again, the only really scalable solution looks to be
fulltext search
of some kind. We've known for a long time that category intersections
can easily be done well enough, for a modest standard of "well
enough", but that hasn't been considered good enough to run on
Wikipedia.
No matter what method, I think the problem should get high priority. I
currently see a case on Commons, where there's now "Category:Paintings
by Vincent van Gogh in this-and-that-museum". It's getting ridiculous
(or is already there).
Magnus
P.S.: Just got a message on my Commons talk page about the
"+incategory:" search function.
* This is based on the Lucene index, right? How often is that updated?
* Is there a decent interface/special page for that? It's a pain to
enter this manually, and I doubt many people know about it
* Is there a machine-readable interface for this? One that will return
5K hits without screenscraping?