Re: [Wikitech-l] The never-dying topic: category intersection

2 Dec 2008


      On Tue, Dec 2, 2008 at 11:40 AM, Magnus Manske
magnusmanske@googlemail.com wrote:
...
I am uncertain how much the toolserver factors in here. The poor thing
is under a lot of stress ;-)
The query has to scan all of the categorylinks rows for all of the
categories you specify, at least in the worst case.  That could be a
few hundred thousand rows, maybe a million or more if you combine
several very large categories.  That will take a few seconds even on
the real servers, probably (from experience with SELECT COUNT(*) FROM
categorylinks WHERE cl_to='Foo' in Special:Category).
...
I ran it again now, and it falls back to the JOIN solution, taking ~10
sec. As a worst-case scenario, I call that acceptable for the tool.
It might not be acceptable for Wikipedia ATM. We could experiment how
this performs on the "real" servers, though.
It might be acceptable if it's not run too often, it just wouldn't be
ideal.  We're not talking about running such queries on every page
view, I assume, so it shouldn't be the end of the world.  It would be
good to get a more efficient way, but the important thing is for
someone to actually get something in the core software period, IMO.
We have any number of toolserver tools to do this, probably at least
five, but that's not going to get us progress.
...
Also, we could restrict certain queries. We know the category size,
and in my approach, we know how many articles are in the "small
category" intersection. Form there, we could guesstimate the
worst-case time, and kill the query, or run it in MySQL slow mode
(forgot the correct name) to not stress the servers too much.
Read uncommitted?
...
No matter what method, I think the problem should get high priority. I
currently see a case on Commons, where there's now "Category:Paintings
by Vincent van Gogh in this-and-that-museum". It's getting ridiculous
(or is already there).
Lots of things should get high priority and don't.  Look at how
sorting on category pages is completely broken for a lot of languages,
for instance, due to sorting in code point order.  Someone with commit
access has to spend the time fix it, is all.
...
P.S.: Just got a message on my Commons talk page about the
"+incategory:" search function.
That doesn't include transcluded categories, so it's not a proper
solution.  I don't know much about it.  Lucene would likely be a good
choice for a "real" solution, but we'd need to make up a separate
table for it, not just use the page text table.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] The never-dying topic: category intersection