On Tue, Dec 2, 2008 at 11:40 AM, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
I am uncertain how much the toolserver factors in
here. The poor thing
is under a lot of stress ;-)
The query has to scan all of the categorylinks rows for all of the
categories you specify, at least in the worst case. That could be a
few hundred thousand rows, maybe a million or more if you combine
several very large categories. That will take a few seconds even on
the real servers, probably (from experience with SELECT COUNT(*) FROM
categorylinks WHERE cl_to='Foo' in Special:Category).
I ran it again now, and it falls back to the JOIN
solution, taking ~10
sec. As a worst-case scenario, I call that acceptable for the tool.
It might not be acceptable for Wikipedia ATM. We could experiment how
this performs on the "real" servers, though.
It might be acceptable if it's not run too often, it just wouldn't be
ideal. We're not talking about running such queries on every page
view, I assume, so it shouldn't be the end of the world. It would be
good to get a more efficient way, but the important thing is for
someone to actually get something in the core software period, IMO.
We have any number of toolserver tools to do this, probably at least
five, but that's not going to get us progress.
Also, we could restrict certain queries. We know the
category size,
and in my approach, we know how many articles are in the "small
category" intersection. Form there, we could guesstimate the
worst-case time, and kill the query, or run it in MySQL slow mode
(forgot the correct name) to not stress the servers too much.
Read uncommitted?
No matter what method, I think the problem should get
high priority. I
currently see a case on Commons, where there's now "Category:Paintings
by Vincent van Gogh in this-and-that-museum". It's getting ridiculous
(or is already there).
Lots of things should get high priority and don't. Look at how
sorting on category pages is completely broken for a lot of languages,
for instance, due to sorting in code point order. Someone with commit
access has to spend the time fix it, is all.
P.S.: Just got a message on my Commons talk page about
the
"+incategory:" search function.
That doesn't include transcluded categories, so it's not a proper
solution. I don't know much about it. Lucene would likely be a good
choice for a "real" solution, but we'd need to make up a separate
table for it, not just use the page text table.