On Tue, Dec 2, 2008 at 11:40 AM, Magnus Manske magnusmanske@googlemail.com wrote:
I am uncertain how much the toolserver factors in here. The poor thing is under a lot of stress ;-)
The query has to scan all of the categorylinks rows for all of the categories you specify, at least in the worst case. That could be a few hundred thousand rows, maybe a million or more if you combine several very large categories. That will take a few seconds even on the real servers, probably (from experience with SELECT COUNT(*) FROM categorylinks WHERE cl_to='Foo' in Special:Category).
I ran it again now, and it falls back to the JOIN solution, taking ~10 sec. As a worst-case scenario, I call that acceptable for the tool.
It might not be acceptable for Wikipedia ATM. We could experiment how this performs on the "real" servers, though.
It might be acceptable if it's not run too often, it just wouldn't be ideal. We're not talking about running such queries on every page view, I assume, so it shouldn't be the end of the world. It would be good to get a more efficient way, but the important thing is for someone to actually get something in the core software period, IMO. We have any number of toolserver tools to do this, probably at least five, but that's not going to get us progress.
Also, we could restrict certain queries. We know the category size, and in my approach, we know how many articles are in the "small category" intersection. Form there, we could guesstimate the worst-case time, and kill the query, or run it in MySQL slow mode (forgot the correct name) to not stress the servers too much.
Read uncommitted?
No matter what method, I think the problem should get high priority. I currently see a case on Commons, where there's now "Category:Paintings by Vincent van Gogh in this-and-that-museum". It's getting ridiculous (or is already there).
Lots of things should get high priority and don't. Look at how sorting on category pages is completely broken for a lot of languages, for instance, due to sorting in code point order. Someone with commit access has to spend the time fix it, is all.
P.S.: Just got a message on my Commons talk page about the "+incategory:" search function.
That doesn't include transcluded categories, so it's not a proper solution. I don't know much about it. Lucene would likely be a good choice for a "real" solution, but we'd need to make up a separate table for it, not just use the page text table.