The never-dying topic: category intersection

List overview All Threads
Download

newer

older

Re: [Wikitech-l] The never-dying...

Does info@wikipedia.org work?

Magnus Manske

2 Dec 2008 2 Dec '08

1:01 p.m.

(feel free to bash me if we had this variant already, I couldn't find it in the list archives)

Task: On German Wikipedia (yay atomic categories!), find women who were born in 1901 and died in 1986. Runtime : Toolserver, <2 sec Query: SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" , "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE tbl1.cnt = 3 ;

Trying to "poison" the query by also looking in all GFDL images ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec., so not that bad.

I've implemented this as a tool now: http://toolserver.org/~magnus/category_intersection.php

Queries seem to take a little longer there (2-4 sec) compared to the command line.

Articles on en.wikipedia with "1905 births" and "1967 deaths" took <0.4 sec. OTOH, looking for images on Commons in "GFDL" and "Buildings in Berlin" took ~2min. Might be the giant GFDL category, or the toolserver, or both. I'll try to fiddle with it some more utilising cat_pages/cat_files.

Magnus

Show replies by date

Magnus Manske

2 Dec 2 Dec

2:11 p.m.

...

OTOH, looking for images on Commons in "GFDL" and "Buildings in Berlin" took ~2min. Might be the giant GFDL category, or the toolserver, or both. I'll try to fiddle with it some more utilising cat_pages/cat_files.

Hah! By using small categories first, then restricting possible page_ids in the query for the larger categories, I got it down to 3 sec!

Testing "Buildings in Berlin" and "PD Old" (to avoid false timings from cache) : < 0.6 sec.

This way, adding more intersections with small categories (where currently "small" is < 20.000 pages) will actually make the query run faster.

I think I'm onto something here. Then again, I thought that before :-)

Magnus

Mohamed Magdy

2:37 p.m.

On Tue, Dec 2, 2008 at 2:01 PM, Magnus Manske magnusmanske@googlemail.comwrote:

...

(feel free to bash me if we had this variant already, I couldn't find it in the list archives)

Task: On German Wikipedia (yay atomic categories!), find women who were born in 1901 and died in 1986. Runtime : Toolserver, <2 sec Query: SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" , "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE tbl1.cnt = 3 ;

Trying to "poison" the query by also looking in all GFDL images ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec., so not that bad.

I've implemented this as a tool now: http://toolserver.org/~magnus/category_intersection.php http://toolserver.org/%7Emagnus/category_intersection.php

Queries seem to take a little longer there (2-4 sec) compared to the command line.

Articles on en.wikipedia with "1905 births" and "1967 deaths" took <0.4 sec. OTOH, looking for images on Commons in "GFDL" and "Buildings in Berlin" took ~2min. Might be the giant GFDL category, or the toolserver, or both. I'll try to fiddle with it some more utilising cat_pages/cat_files.

Magnus

Very nice. Danke.

You should mention that categories are entered each at a separate line (or an example), as it took me some trials to figure it out.

-- --alnokta

Aryeh Gregor

3:57 p.m.

On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

(feel free to bash me if we had this variant already, I couldn't find it in the list archives)

Task: On German Wikipedia (yay atomic categories!), find women who were born in 1901 and died in 1986. Runtime : Toolserver, <2 sec Query: SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" , "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE tbl1.cnt = 3 ;

This will fail with a syntax error on the main servers, because subqueries aren't supported in MySQL 4.0. You don't really need the subquery, though; you should be able to just use HAVING:

SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY cl_from HAVING COUNT(cl_to) = 3;

Your solution requires filesorting the union of the categories, as far as I can tell. I would expect it, offhand, to be significantly slower than a solution using joins:

SELECT page_title FROM page JOIN categorylinks AS cl1 ON page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to = 'Gestorben_1986';

But I haven't benchmarked it, and who knows what kind of execution quirks are happening here.

...

Trying to "poison" the query by also looking in all GFDL images ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec., so not that bad.

3 seconds is a very long time for a query to run. Typical queries should take more like, say, 10 ms. Occasional selects taking three seconds might or might not kill the servers, but they're far from optimal. Also, did you try in a really worst-case scenario, like intersecting "Unprintworthy redirects" with "Stub-Class biography articles" on enwiki? Obviously users aren't likely to legitimately run an intersection of those exact categories (since they're logically disjoint), but you should test this kind of thing to ensure scalability. The query appears to take 16s on your tool.

Again, the only really scalable solution looks to be fulltext search of some kind. We've known for a long time that category intersections can easily be done well enough, for a modest standard of "well enough", but that hasn't been considered good enough to run on Wikipedia.

Magnus Manske

5:40 p.m.

On Tue, Dec 2, 2008 at 2:57 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:

...

On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...
(feel free to bash me if we had this variant already, I couldn't find it in the list archives)

Task: On German Wikipedia (yay atomic categories!), find women who were born in 1901 and died in 1986. Runtime : Toolserver, <2 sec Query: SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" , "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE tbl1.cnt = 3 ;

This will fail with a syntax error on the main servers, because subqueries aren't supported in MySQL 4.0. You don't really need the subquery, though; you should be able to just use HAVING:

SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY cl_from HAVING COUNT(cl_to) = 3;

Your're right. Fixed in the tool.

...

Your solution requires filesorting the union of the categories, as far as I can tell. I would expect it, offhand, to be significantly slower than a solution using joins:

SELECT page_title FROM page JOIN categorylinks AS cl1 ON page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to = 'Gestorben_1986';

But I haven't benchmarked it, and who knows what kind of execution quirks are happening here.

It seems the JOIN query is significantly faster when all categories are large.

However, with one or more small categories, I can do a pre-selection of pages (get page_ids for the intersection of the small categories, then look only for these in the larger ones), which in turn is significantly faster than the JOIN. My tool now uses the algorithm appropriate for the respective query.

...

...
Trying to "poison" the query by also looking in all GFDL images ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec., so not that bad.

3 seconds is a very long time for a query to run. Typical queries should take more like, say, 10 ms. Occasional selects taking three seconds might or might not kill the servers, but they're far from optimal.

I am uncertain how much the toolserver factors in here. The poor thing is under a lot of stress ;-)

...

Also, did you try in a really worst-case scenario, like intersecting "Unprintworthy redirects" with "Stub-Class biography articles" on enwiki? Obviously users aren't likely to legitimately run an intersection of those exact categories (since they're logically disjoint), but you should test this kind of thing to ensure scalability. The query appears to take 16s on your tool.

I ran it again now, and it falls back to the JOIN solution, taking ~10 sec. As a worst-case scenario, I call that acceptable for the tool.

It might not be acceptable for Wikipedia ATM. We could experiment how this performs on the "real" servers, though.

Also, we could restrict certain queries. We know the category size, and in my approach, we know how many articles are in the "small category" intersection. Form there, we could guesstimate the worst-case time, and kill the query, or run it in MySQL slow mode (forgot the correct name) to not stress the servers too much.

...

Again, the only really scalable solution looks to be fulltext search of some kind. We've known for a long time that category intersections can easily be done well enough, for a modest standard of "well enough", but that hasn't been considered good enough to run on Wikipedia.

No matter what method, I think the problem should get high priority. I currently see a case on Commons, where there's now "Category:Paintings by Vincent van Gogh in this-and-that-museum". It's getting ridiculous (or is already there).

Magnus

P.S.: Just got a message on my Commons talk page about the "+incategory:" search function. * This is based on the Lucene index, right? How often is that updated? * Is there a decent interface/special page for that? It's a pain to enter this manually, and I doubt many people know about it * Is there a machine-readable interface for this? One that will return 5K hits without screenscraping?

Aryeh Gregor

6:07 p.m.

On Tue, Dec 2, 2008 at 11:40 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

I am uncertain how much the toolserver factors in here. The poor thing is under a lot of stress ;-)

The query has to scan all of the categorylinks rows for all of the categories you specify, at least in the worst case. That could be a few hundred thousand rows, maybe a million or more if you combine several very large categories. That will take a few seconds even on the real servers, probably (from experience with SELECT COUNT(*) FROM categorylinks WHERE cl_to='Foo' in Special:Category).

...

I ran it again now, and it falls back to the JOIN solution, taking ~10 sec. As a worst-case scenario, I call that acceptable for the tool.

It might not be acceptable for Wikipedia ATM. We could experiment how this performs on the "real" servers, though.

It might be acceptable if it's not run too often, it just wouldn't be ideal. We're not talking about running such queries on every page view, I assume, so it shouldn't be the end of the world. It would be good to get a more efficient way, but the important thing is for someone to actually get something in the core software period, IMO. We have any number of toolserver tools to do this, probably at least five, but that's not going to get us progress.

...

Also, we could restrict certain queries. We know the category size, and in my approach, we know how many articles are in the "small category" intersection. Form there, we could guesstimate the worst-case time, and kill the query, or run it in MySQL slow mode (forgot the correct name) to not stress the servers too much.

Read uncommitted?

...

No matter what method, I think the problem should get high priority. I currently see a case on Commons, where there's now "Category:Paintings by Vincent van Gogh in this-and-that-museum". It's getting ridiculous (or is already there).

Lots of things should get high priority and don't. Look at how sorting on category pages is completely broken for a lot of languages, for instance, due to sorting in code point order. Someone with commit access has to spend the time fix it, is all.

...

P.S.: Just got a message on my Commons talk page about the "+incategory:" search function.

That doesn't include transcluded categories, so it's not a proper solution. I don't know much about it. Lucene would likely be a good choice for a "real" solution, but we'd need to make up a separate table for it, not just use the page text table.

Bryan Tong Minh

6:59 p.m.

On Tue, Dec 2, 2008 at 5:40 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Is there a machine-readable interface for this? One that will return

5K hits without screenscraping?

api.php?list=search?

Robert Stojnic

10:59 p.m.

...

P.S.: Just got a message on my Commons talk page about the "+incategory:" search function.

This is based on the Lucene index, right? How often is that updated?

It is updated daily. As already pointed out, it doesn't do transcluded categories, but just looks at Category: links within raw article wikitext.

...

Is there a decent interface/special page for that? It's a pain to

enter this manually, and I doubt many people know about it

Nope, no interface. I've pretty much made it just because it was easy to do, and doesn't really take up any significant space in the index. If one dared to make a category intersection frontend it could possibly be useful for testing.

However, as discussed before, making an efficient and easily-integrable-into-WMF-type-setup backend is not exactly straightforward.

Cheers, Robert

Daniel Schwen

5:01 p.m.

New subject: The never-dying topic: category intersection (been there done that)

...

SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" , "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE tbl1.cnt = 3 ;

Mh, yeah, that is pretty much the same idea that my http://toolserver.org/~dschwen/intersection/ uses

Except that I'm using several queries into a temporary table instead of assembling one query with subqueries (plus it also supports link-intersection).

But, then again my too supports deep indexing, which _needs_ multiple queries. So I opted for flexibility here.

...

Task: On German Wikipedia (yay atomic categories!), find women who

Yeah this is all nice and fine, but we've discussed this issue ad nauseam:

* Atomic categories = _trivial_ intersection * Non-atomic categories = total bullshit that makes me vomit (sorry guys!)

I find it a little frustrating that this wheel gets reinvented so often. My tool was used a couple of times after I posted it, and now as maybe one user per day (from a quick glance at the logs). What is going on here? I'm stating to think tht nobody actually gives a damn about category intersection, except for a couple of vocal people on the mailing list. And out of these only a fraction actually _works_ on the problem.

So we have shown multiple times now that cat intersection is technically feasible. What we nee now is massive lobbying for atomic categorisation. THAT is the hurdle right now IMO. Not some SQL queries.

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

Aryeh Gregor

5:12 p.m.

New subject: The never-dying topic: category intersection (been there done that)

On Tue, Dec 2, 2008 at 11:01 AM, Daniel Schwen lists@schwen.de wrote:

...

So we have shown multiple times now that cat intersection is technically feasible. What we nee now is massive lobbying for atomic categorisation. THAT is the hurdle right now IMO. Not some SQL queries.

I'd say that what we need is someone to add proper support for this to the core software and get it enabled on Wikimedia sites, actually. A toolserver tool is just not the same as having the feature integrated into the software, in terms of usage levels. It might be that the implementations written so far are not efficient enough for enabling on Wikimedia, but nobody with commit access has even tried.

Daniel Schwen

5:20 p.m.

New subject: The never-dying topic: category intersection (been there done that)

...

I'd say that what we need is someone to add proper support for this to the core software and get it enabled on Wikimedia sites, actually. A

Then I suggest that Magnus immediately stops working on it, or else his curse of never getting anything into the core might strike ;-)

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

Aryeh Gregor

5:33 p.m.

New subject: The never-dying topic: category intersection (been there done that)

On Tue, Dec 2, 2008 at 11:20 AM, Daniel Schwen lists@schwen.de wrote:

...

Then I suggest that Magnus immediately stops working on it, or else his curse of never getting anything into the core might strike ;-)

Doesn't he have commit access?

Daniel Schwen

5:41 p.m.

New subject: The never-dying topic: category intersection (been there done that)

...

...
curse of never getting anything into the core might strike ;-)

Doesn't he have commit access?

True, but is gems never seem to get enabled...

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

Magnus Manske

5:43 p.m.

New subject: The never-dying topic: category intersection (been there done that)

On Tue, Dec 2, 2008 at 4:33 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:

...

On Tue, Dec 2, 2008 at 11:20 AM, Daniel Schwen lists@schwen.de wrote:

...
Then I suggest that Magnus immediately stops working on it, or else his curse of never getting anything into the core might strike ;-)

Doesn't he have commit access?

Yes, but Brion has revert access :-)

Aerik

3 Dec 3 Dec

6:49 p.m.

New subject: [Wikitech-l] The never-dying topic: category intersection (been there done that)

Aryeh Gregor <Simetrical+wikilist@...> writes:

...

On Tue, Dec 2, 2008 at 11:01 AM, Daniel Schwen <lists@...> wrote:

...
So we have shown multiple times now that cat intersection is technically feasible. What we nee now is massive lobbying for atomic categorisation. THAT is the hurdle right now IMO. Not some SQL queries.

I'd say that what we need is someone to add proper support for this to the core software and get it enabled on Wikimedia sites, actually. A toolserver tool is just not the same as having the feature integrated into the software, in terms of usage levels. It might be that the implementations written so far are not efficient enough for enabling on Wikimedia, but nobody with commit access has even tried.

I'm with you - we've shown feasibility in large datasets with a lucene based approach, and I think we need to roll it out and test it with real users on real data. We need a new lucene index and a user interface (needs to be defined) suitable for average users to find useful. I'm thinking of a "browse related categories" type of function.

Best Regards, Aerik

David Gerard

6:58 p.m.

New subject: The never-dying topic: category intersection (been there done that)

2008/12/3 Aerik aerik@thesylvans.com:

...

I'm with you - we've shown feasibility in large datasets with a lucene based approach, and I think we need to roll it out and test it with real users on real data. We need a new lucene index and a user interface (needs to be defined) suitable for average users to find useful. I'm thinking of a "browse related categories" type of function.

Write something the Commons cabal(tm) will love and you'll be most rewarded with joy and happy users and stuff.

- d.

Lars Aronsson

2 Dec 2 Dec

5:59 p.m.

New subject: The never-dying topic: category intersection (been there done that)

Daniel Schwen wrote:

...

I find it a little frustrating that this wheel gets reinvented so often. My tool was used a couple of times after I posted it, and now as maybe one user per day (from a quick glance at the

Users of the Swedish Wikipedia are increasingly starting to use Duesentrieb's CatScan tool. It is really useful, but could need some further improvement, especially in the handling of large categories.

...

So we have shown multiple times now that cat intersection is technically feasible. What we nee now is massive lobbying for atomic categorisation. THAT is the hurdle right now IMO. Not some SQL queries.

After a lengthy discussion (over many years) about category:tennis players and category:female tennis players in the Swedish Wikipedia, I created in late August 2008 the category:men and category:women, so that all profession categories could be freed from the burden of also documenting the gender. The Swedish Wikipedia still has a category:Danish tennis players (combining profession and nationality), just like the English Wikipedia, but gender is now documented separately, as in the German Wikipedia.

All three languages have a category:1942 births. I think no language of Wikipedia has a combined category for tennis players born in 1942. So the question of atomic categories is not an absolute. It is more or less implemented everywhere. For finding tennis players born in 1942, even the English Wikipedia needs to do cross sectioning of categories.

Radically changing the categorization system is not realistic. It was a huge effort already to introduce men/women in the Swedish Wikipedia, even though this was just adding categories (not removing any), and even though Swedish is not among the largest 10 Wikipedias. Within 3 months (September-November), some 75,000 articles were categorized, of which 15,000 women and 60,000 men. The ratio 1:4 (1 woman for every 4 men) is far more equal than the 1:6 ratio of the German Wikipedia.

What I discovered then was that of these 75,000 biographies, only 60,000 were categorized according to year of birth. So we now have to birth categorize 15,000 articles before we can compile reliable statistics on how the gender imbalance shifts over time. Early estimates show that there is a 1:10 gender ratio in the 18th century and a 1:3 ratio for those born in the 1970s.

So the larger imbalance (1:6) of the German Wikipedia might be explained by having a larger amount of 18th century biographies.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Daniel Schwen

10:37 p.m.

New subject: The never-dying topic: category intersection (been there done that)

...

born in 1942. So the question of atomic categories is not an absolute. It is more or less implemented everywhere. For finding

I'm going to stop you right here! One word: 'commons'

'Nuff said.

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

Nikola Smolenski

6:05 p.m.

New subject: The never-dying topic: category intersection (been there done that)

On Tuesday 02 December 2008 17:01:30 Daniel Schwen wrote:

...

I find it a little frustrating that this wheel gets reinvented so often. My tool was used a couple of times after I posted it, and now as maybe one user per day (from a quick glance at the logs). What is going on here? I'm stating to think tht nobody actually gives a damn about category intersection, except for a couple of vocal people on the mailing list. And out of these only a fraction actually _works_ on the problem.

Perhaps just category intersection isn't enough. I was thinking about a tool that would allow intersection of various article data, including the categories.

For example, suppose that I am maintaining [[1991 in art]]. I would want to find all articles that link to [[1991]] and are in a subcategory of [[:Category:Visual arts]]. And don't even get me started about what could be done if template parameters would be recorder somewhere...

Daniel Schwen

6:24 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of two)

...

Perhaps just category intersection isn't enough. I was thinking about a tool that would allow intersection of various article data, including the categories.

For example, suppose that I am maintaining [[1991 in art]]. I would want to find all articles that link to [[1991]] and are in a subcategory of [[:Category:Visual arts]]. And don't even get me started about what could be done if template parameters would be recorder somewhere...

Is anybody actually reading what I write? Skimming apparently does not cut it!

Go to http://toolserver.org/~dschwen/intersection/

It does precisely that. Category-intersection plus Link-intersection. Also see:

http://en.wikipedia.org/wiki/Wikipedia:Link_intersection http://en.wikipedia.org/wiki/Wikipedia:Category_intersection

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

Roan Kattouw

3 Dec 3 Dec

4:48 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

We had a pretty lengthy discussion about this before the summer, and the consensus seemed to be that a fulltext-based approach looked most viable. I actually wrote an extension that does that, and promised to release it soon; that was quite a few months ago, and I never got around to it. I'll release it properly when I have time, which will hopefully be before Christmas :D

The code needs some tweaking and refactoring, though. It's pretty tightly integrated with the article text search (both functions in one form) and has all kinds of weird features, because the guy who paid me to write it wanted them. It also doesn't support three-letter word searching (which core does these days, using a prefix hack), which is pretty bad since categories with short titles (or stopword titles) won't be found either.

Roan Kattouw (Catrope)

Daniel Schwen

4:59 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

...

We had a pretty lengthy discussion about this before the summer, and the consensus seemed to be that a fulltext-based approach looked most viable.

So how does this take care of deep indexing non-atomic categories? =>How will this extension be even remotely useful for let's say commons?

This discussion is far from over. The basic problems are _not_ solved.

I'm sure this thread will die out soon. Half of the participants will again be soothed by the promise of some easy solution just barely beyond the horizon, while the half that realizes that said solution _cannot possibly work_ without a radical reform of the category system will again be too annoyed (I'm getting there already) to continue discussing.

Deja vue...

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

Roan Kattouw

5:05 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

Daniel Schwen schreef:

...

...
We had a pretty lengthy discussion about this before the summer, and the consensus seemed to be that a fulltext-based approach looked most viable.

So how does this take care of deep indexing non-atomic categories?

Err.. what? Please explain what you mean by that.

...

=>How will this extension be even remotely useful for let's say commons?

Without addressing Commons in particular, having an efficient way to get pages in the intersection of multiple categories would allow wikis to delete a category such as [[Category:Deceased Presidents of the United States]] and replace it by, say, [[Intersection:Deceased Presidents of the United States]], which would list all articles in [[Category:Deceased people]] and [[Category:Presidents of the United States]]. My extension alone doesn't make that possible, but it makes implementing such a feature considerably easier.

...

This discussion is far from over. The basic problems are _not_ solved.

Would you care to elaborate on what those unsolved problems are?

...

I'm sure this thread will die out soon. Half of the participants will again be soothed by the promise of some easy solution just barely beyond the horizon, while the half that realizes that said solution _cannot possibly work_ without a radical reform of the category system will again be too annoyed (I'm getting there already) to continue discussing.

It would be nice if you didn't judge people as naive rightaway.

Roan Kattouw (Catrope)

Gregory Maxwell

5:31 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

On Wed, Dec 3, 2008 at 11:05 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...

Without addressing Commons in particular, having an efficient way to get pages in the intersection of multiple categories would allow wikis to delete a category such as [[Category:Deceased Presidents of the United States]] and replace it by, say, [[Intersection:Deceased Presidents of the United States]], which would list all articles in [[Category:Deceased people]] and [[Category:Presidents of the United States]]. My extension alone doesn't make that possible, but it makes implementing such a feature considerably easier.

[snip]

We've had tools like this on toolserver before, with decent performance and the ability to be embedded into commons via cross site JS hacks, and been told in no uncertain terms that the community policy is "do not over categorize; things should be placed in the fewest and most specific categories possible". On commons there are quite a few contributors who spend all of their time converting the set of categories on an image to the one or two most specific categories.

Please pardon Dschwen's frustration: because it seems like people are constantly waving their arms and saying that there will be some wonderful technical solution right around the corner for the problems created by the current categorization approach (never mind that some of them, such as the extreme semantic drift, are unsolvable with a technical solution).

For commons, and a lesser degree other projects, the limiting factor in the usability of an intersection tool is less the lack of one and more the insistence of the userbase of using categories in a manner which is generally incompatible with them.

For the purposes of MediaWiki these factors are not important, I suppose, but it does explain the sceptical response.

David Gerard

5:06 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

2008/12/3 Daniel Schwen lists@schwen.de:

...

I'm sure this thread will die out soon. Half of the participants will again be soothed by the promise of some easy solution just barely beyond the horizon, while the half that realizes that said solution _cannot possibly work_ without a radical reform of the category system will again be too annoyed (I'm getting there already) to continue discussing.

If the machinery is in place to replace the present ridiculous sub-sub-sub-categories with something that *does their job just as well*, they'll die in quite reasonable order.

If the machinery can't completely replace them without editor pain, it'll fail. If it can, it won't and Commons will be ENORMOUSLY happy 'cos we can then go wild treating cats like tags!

- d.

Aryeh Gregor

5:13 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

On Wed, Dec 3, 2008 at 10:59 AM, Daniel Schwen lists@schwen.de wrote:

...

So how does this take care of deep indexing non-atomic categories? =>How will this extension be even remotely useful for let's say commons?

That's a social problem, and so of secondary importance. Once a technical mechanism exists for solving the problem given a particular type of categories, recategorization will happen, sooner or later. If you think people will flat-out refuse to move to a new, better system, I think you're mistaken: look at the completeness of the move from lists to categories, for instance, when categories were first introduced. (Lists are still used, but in most cases only where they do things that categories currently cannot.) The same goes for all the other useful technical innovations that get introduced. All it would take is running some bots for a while to switch to the better system, not a big cost for a large wiki like Commons with plenty of bot operators.

On a technical level, dealing with non-atomic categories is a much bigger pain than dealing with atomic ones. On a social level, on the other hand, they're equally doable, as dewiki shows. There will be transition costs for wikis that have a large body of non-atomic categories, but those will be one-time only.

Daniel Schwen

5:43 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

...

the other useful technical innovations that get introduced. All it would take is running some bots for a while to switch to the better system, not a big cost for a large wiki like Commons with plenty of bot operators.

I'd like for you to be right. But switching from the present category system to atomic categories is not as straight forward as having a few bots run over all existing cats.

It will require an enormous amount of work. And so far I have not met willingness to change anything. Greg has shown a long time ago that fast category intersection is doable, but the echo has been pretty much zip, nada.

Just note that simply replacing a category with all of it super categories is a dead end. You wouldn't believe the twists and turns in the category tree. Amusing example have been posted on this list already.

So, yeah, sorry for my tone. I've pretty much kept my cool for the last N incarnations of this debate, but after repeating all the arguments for atomic cats and intersections and seeing zero improvement I'm getting a little frustrated. Call it "empiric evidence" rather than "assuming people to be naive" ;-)

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

Aryeh Gregor

4 Dec 4 Dec

12:35 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

On Wed, Dec 3, 2008 at 11:43 AM, Daniel Schwen lists@schwen.de wrote:

...

I'd like for you to be right. But switching from the present category system to atomic categories is not as straight forward as having a few bots run over all existing cats.

Of course, humans would have to manually specify which new categories each old one corresponds to, but that's a perfectly doable job for a small group of volunteers working over the course of months. The bots would do the much more tedious work of actually replacing them, so each category could take substantially less than a minute of human review. The category intersection feature would then get incrementally more useful as the work progressed.

...

It will require an enormous amount of work. And so far I have not met willingness to change anything. Greg has shown a long time ago that fast category intersection is doable, but the echo has been pretty much zip, nada.

There's a world of difference between showing that something is feasible in theory, and making it a core part of the software that's visible on every category page on every Wikimedia wiki without asking for community consensus in advance. As soon as people actually start using the feature, and they will if there's a box on every category page, they'll realize that it would be way more useful if they changed how things are categorized. As long as category intersections remain vaporware, there's no incentive to change. A technical fait accompli will bring about change.

Even if Commons hypothetically didn't go along with the scheme, it would be valuable to have it in the software anyway. Plenty of wikis could still use it, like dewiki. We need an interface and we need a backend and we need someone to hook them together and commit them to Subversion. People have spent too much time inventing and reinventing and re-reinventing new and different but basically interchangeable backends, and too little time on the other parts of the problem. If the feature were committed to the software with a completely brainless backend unusable on Wikimedia wikis, I predict it would be live on all sites in less than six months.

Daniel Schwen

1:12 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

...

how things are categorized. As long as category intersections remain vaporware, there's no incentive to change. A technical fait accompli will bring about change.

Uhm, yeah.. except that intersection of atomic categories are not vaporware. We had proofs of concept for that and the interest was marginal.

In any case. If someone would really just shoved it into mw core and enabled it on all the wmf sites I'd be happy. I concur that it would make the job convincing useres of a less retarded categorization scheme a bit easier.

As far as Aeriks soapboxing from a few emails back goes: Let's not kid ourselves, tag based categorization is standard on commercial sites such as stockphotography libraries. We are not exactly inventing this...

I'll shut up now, and I really hope that this is the last time we're having this discussion... (but boy, you will get an earfull if it isn't ;-) )

-- [[en:User:Dschwen]] [[de:Benutzer:Dschwen]] [[commons:User:Dschwen]]

David Gerard

2:12 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

2008/12/4 Daniel Schwen lists@schwen.de:

...

...
how things are categorized. As long as category intersections remain vaporware, there's no incentive to change. A technical fait accompli will bring about change.

...

Uhm, yeah.. except that intersection of atomic categories are not vaporware. We had proofs of concept for that and the interest was marginal.

It's vaporware until it's usable as a tagging system in practice.

...

In any case. If someone would really just shoved it into mw core and enabled it on all the wmf sites I'd be happy. I concur that it would make the job convincing useres of a less retarded categorization scheme a bit easier. As far as Aeriks soapboxing from a few emails back goes: Let's not kid ourselves, tag based categorization is standard on commercial sites such as stockphotography libraries. We are not exactly inventing this...

This being precisely what Commons has been begging for for a while!

...

I'll shut up now, and I really hope that this is the last time we're having this discussion... (but boy, you will get an earfull if it isn't ;-) )

The last time will be when there's a feature end-users can use without going off to the toolserver.

- d.

Gregory Maxwell

2:16 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

On Wed, Dec 3, 2008 at 8:12 PM, David Gerard dgerard@gmail.com wrote:

...

The last time will be when there's a feature end-users can use without going off to the toolserver.

With a JS hack I had my tool integrated to the site. The AJAX calls went to the toolserver, but as far as the users could see it was running on the site. No one cared: It didn't produce useful results because of how categories are used, and when I suggested changing people just waved their arms at me "just make it walk the tree".

David Gerard

2:22 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

2008/12/4 Gregory Maxwell gmaxwell@gmail.com:

...

On Wed, Dec 3, 2008 at 8:12 PM, David Gerard dgerard@gmail.com wrote:

...

...
The last time will be when there's a feature end-users can use without going off to the toolserver.

...

With a JS hack I had my tool integrated to the site. The AJAX calls went to the toolserver, but as far as the users could see it was running on the site. No one cared: It didn't produce useful results because of how categories are used, and when I suggested changing people just waved their arms at me "just make it walk the tree".

Hmm, I musta missed this. I woulda thought the commons-l habitues would have swooped upon it with great glee.

- d.

Ilmari Karonen

8:15 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

Gregory Maxwell wrote:

...

With a JS hack I had my tool integrated to the site. The AJAX calls went to the toolserver, but as far as the users could see it was running on the site. No one cared: It didn't produce useful results because of how categories are used, and when I suggested changing people just waved their arms at me "just make it walk the tree".

That _is_ curious. When did this happen? It seems I also blinked and missed it.

-- Ilmari Karonen

Alex

8:29 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

Gregory Maxwell wrote:

...

On Wed, Dec 3, 2008 at 8:12 PM, David Gerard dgerard@gmail.com wrote:

...
The last time will be when there's a feature end-users can use without going off to the toolserver.

With a JS hack I had my tool integrated to the site. The AJAX calls went to the toolserver, but as far as the users could see it was running on the site. No one cared: It didn't produce useful results because of how categories are used, and when I suggested changing people just waved their arms at me "just make it walk the tree".

Its sort of a cycle we're stuck in. There's not much interest in developing a good category intersection tool for core because the category system on the larger Wikimedia wikis won't really work well with it. If we develop it there's the risk of the response being the same as to yours, basically: "Why should we change all the categories? Just change the tool."

And there's no incentive to change the category system until we actually have a category intersection tool in core. If people actually do it there's the risk that an intersection tool is still a long way off and we're stuck with less-useful categories (though I personally find the current system, at least on enwiki, to be mostly useless).

-- Alex (wikipedia:en:User:Mr.Z-man)

Aryeh Gregor

5 Dec 5 Dec

12:55 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

On Wed, Dec 3, 2008 at 7:12 PM, Daniel Schwen lists@schwen.de wrote:

...

Uhm, yeah.. except that intersection of atomic categories are not vaporware. We had proofs of concept for that and the interest was marginal.

Vaporware with proofs of concept is still vaporware. The definition of vaporware is more or less something that doesn't go *beyond* proofs of concept. Category intersection has never been added to the software and there's no timetable for adding it to the software, so doing any recategorization right *now* to aid category intersection would be pointless. JS thingies may have been enabled on some wikis for some time periods, but that's very different from a feature being prominently added to *all* wikis.

On Wed, Dec 3, 2008 at 8:16 PM, Gregory Maxwell gmaxwell@gmail.com wrote:

...

With a JS hack I had my tool integrated to the site. The AJAX calls went to the toolserver, but as far as the users could see it was running on the site. No one cared: It didn't produce useful results because of how categories are used, and when I suggested changing people just waved their arms at me "just make it walk the tree".

What was the interface like (how noticeable/obtrusive), how long was it up, and why did it get removed? You're certainly going to need a critical mass of people who know about it and use it before there will be any effect. And enabling it on all wikis at once would likely help, too: if Germans get used to using it on dewiki and find it useful, they'll be more likely to push for it to be made useful on Commons.

On Thu, Dec 4, 2008 at 7:45 AM, Tim Landscheidt tim@tim-landscheidt.de wrote:

...

Add to that the maintenance costs because you would want to ensure that if someone who is not aware of the concept of atomic categories adds a [[Category:Manhattan]] to something he adds [[Category:New York]], [[Category:East Coast of the United States]], [[Category:United States]] and the other gigazillion umbrella categories as well so searches for a building in a country bordering a water body will still show results.

A reasonable point. In the medium term it could be handled by (you guessed it) bots. In the longer term, allowing people to define more concrete semantic relationships between categories (e.g., "X is partitioned into X1, X2, ..., Xn") could make this automatic within the software itself.

In the end, all of these objections are really irrelevant to the technical issues here. The fact of the matter is that category intersection is widely supported in other major software products (in the form of tag intersection), it's something that a lot of people want, and so it would be good if it were in the core software. How fully various specific communities would want to use it is up to them -- that some communities might never choose to use a particular feature doesn't mean that it shouldn't be developed (cf. FlaggedRevs, etc.).

David Gerard

1:02 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

2008/12/4 Aryeh Gregor Simetrical+wikilist@gmail.com:

...

On Wed, Dec 3, 2008 at 8:16 PM, Gregory Maxwell gmaxwell@gmail.com wrote:

...

...
With a JS hack I had my tool integrated to the site. The AJAX calls went to the toolserver, but as far as the users could see it was running on the site. No one cared: It didn't produce useful results because of how categories are used, and when I suggested changing people just waved their arms at me "just make it walk the tree".

...

What was the interface like (how noticeable/obtrusive), how long was it up, and why did it get removed? You're certainly going to need a critical mass of people who know about it and use it before there will be any effect.

Evidently at least two of us who were drooling for this feature failed to become aware of it ...

...

And enabling it on all wikis at once would likely help, too: if Germans get used to using it on dewiki and find it useful, they'll be more likely to push for it to be made useful on Commons.

oooooooooh. How to hack the Wikimedia social structure.

(mind you, I'll believe it's a conclusive solution when flagged revs hit en:wp.)

...

In the end, all of these objections are really irrelevant to the technical issues here. The fact of the matter is that category intersection is widely supported in other major software products (in the form of tag intersection), it's something that a lot of people want, and so it would be good if it were in the core software. How fully various specific communities would want to use it is up to them -- that some communities might never choose to use a particular feature doesn't mean that it shouldn't be developed (cf. FlaggedRevs, etc.).

Indeed.

Greg, can your thingummy please be switched on again and publicised as such on commons-l, if that's not impossible?

- d.

Tim Landscheidt

4 Dec 4 Dec

1:45 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote:

...

...
I'd like for you to be right. But switching from the present category system to atomic categories is not as straight forward as having a few bots run over all existing cats.

...

Of course, humans would have to manually specify which new categories each old one corresponds to, but that's a perfectly doable job for a small group of volunteers working over the course of months. The bots would do the much more tedious work of actually replacing them, so each category could take substantially less than a minute of human review. The category intersection feature would then get incrementally more useful as the work progressed. [...]

Add to that the maintenance costs because you would want to ensure that if someone who is not aware of the concept of atomic categories adds a [[Category:Manhattan]] to something he adds [[Category:New York]], [[Category:East Coast of the United States]], [[Category:United States]] and the other gigazillion umbrella categories as well so searches for a building in a country bordering a water body will still show results.

Tim

David Gerard

1:49 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

2008/12/4 Tim Landscheidt tim@tim-landscheidt.de:

...

Add to that the maintenance costs because you would want to ensure that if someone who is not aware of the concept of atomic categories adds a [[Category:Manhattan]] to something he adds [[Category:New York]], [[Category:East Coast of the United States]], [[Category:United States]] and the other gigazillion umbrella categories as well so searches for a building in a country bordering a water body will still show results.

Which is why we have zillions of obsessive nerdy humans writing the encyclopedia. Tags are fine, there's nothing wrong intrinsically with hundreds of tags where appropriate and useful. I suppose presentation in Monobook will be interesting ...

- d.

Platonides

6 Dec 6 Dec

1:25 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

David Gerard wrote:

...

2008/12/4 Tim Landscheidt:

...
Add to that the maintenance costs because you would want to ensure that if someone who is not aware of the concept of atomic categories adds a [[Category:Manhattan]] to something he adds [[Category:New York]], [[Category:East Coast of the United States]], [[Category:United States]] and the other gigazillion umbrella categories as well so searches for a building in a country bordering a water body will still show results.

Which is why we have zillions of obsessive nerdy humans writing the encyclopedia. Tags are fine, there's nothing wrong intrinsically with hundreds of tags where appropriate and useful. I suppose presentation in Monobook will be interesting ...

d.

If we're going to end up with hundreds of categories on each page, why not make the software automatically add all parent categories? It would fill the categorylinks table*, but it would as well by manually adding them. It would also require forcing the categories to be a graph and maybe limiting the number of parent categories, as to reduce a bit how expensive category position changes can be. But, if we leave that to 'manual actions', the same actions would be performed by bots, leading to the same cost and partially less coherent structure.

*Add a expandedcategorylinks table? Probably also add a 'don't inherit' flag on categary table which can be appplied to high level categories such as 'All licenses' or 'Commons root'.

Gregory Maxwell

5:21 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

On Sat, Dec 6, 2008 at 7:25 AM, Platonides Platonides@gmail.com wrote:

...

If we're going to end up with hundreds of categories on each page, why not make the software automatically add all parent categories? It would fill the categorylinks table*, but it would as well by manually adding them. It would also require forcing the categories to be a graph and maybe limiting the number of parent categories, as to reduce a bit how expensive category position changes can be. But, if we leave that to 'manual actions', the same actions would be performed by bots, leading to the same cost and partially less coherent structure.

[snip]

Because adding the parents produces non-sense results because "categorization" is a flawed concept except at the most fuzzy and course levels: Reality doesn't fit into neat nested boxes (not even the N-dimensional ones created by multiple parentage). The two primary problems are semantic drift (the further away you get from a relationship the more not-quite-matching error accumulates), and multiple link types (we use categories to describe different types of membership, and while within a type the membership relation is commutative among types it is usually not). So with parentages you get chains like [periodic table]->[hydrogen]->[hydrogen compounds]->[water]->[places with water]->[beaches]->[beaches in america]->[beaches of lalaville]->[lalavill beach]->[Image:Ironmeteor_at_lalavill_beach.jpg]

Is an iron meteor a "beach in america" or a "hydrogen compound"? No.

Offering all the parents with an easy checkbox interface that allows you quickly adopt all that apply would be great, but forcing their inclusion would produce rubbish.

Ilmari Karonen

8 Dec 8 Dec

11:55 p.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

Gregory Maxwell wrote:

...

Because adding the parents produces non-sense results because "categorization" is a flawed concept except at the most fuzzy and course levels: Reality doesn't fit into neat nested boxes (not even the N-dimensional ones created by multiple parentage). The two primary problems are semantic drift (the further away you get from a relationship the more not-quite-matching error accumulates), and multiple link types (we use categories to describe different types of membership, and while within a type the membership relation is commutative among types it is usually not). So with parentages you get chains like [periodic table]->[hydrogen]->[hydrogen compounds]->[water]->[places with water]->[beaches]->[beaches in america]->[beaches of lalaville]->[lalavill beach]->[Image:Ironmeteor_at_lalavill_beach.jpg]

Is an iron meteor a "beach in america" or a "hydrogen compound"? No.

True, but there are _some_ relationships that should always hold. All dogs are animals. All integers are numbers. All places in New York are in the United States. Arguably, any page which is in [[Category:Dogs]] but not in [[Category:Animals]] is a failure of atomic categorization.

Of course, there are also many relationships that _don't_ hold so strictly. Most dogs are pets, but not all. Most places in the United States are in North America, but not all. So, yes, some of the consistency checking will have to be done at least partly manually.

But really, I wouldn't worry about this too much. Sure, having a way to enforce some category relationships would be useful, as would automatically recommending others. But even if we don't implement it immediately in the software, someone will write a bot (or several) to help with it. It won't be perfect, but I wouldn't expect it to be much more broken than the current interlanguage link system, which we consider useful enough to keep deployed despite its numerous failings.

(While thinking about this, I thought back to an earlier discussion on this list (or possibly wikien-l, can't remember now) about the fact that there are essentially two types of categories: thematic and taxonomic. For the former, the "tag" model of atomic categorization is quite natural, but the latter would fit much more naturally into a strictly hierarchical model. It might not be an entirely unreasonable idea to formally split the two, perhaps even into separate namespaces, and apply different technical approaches to handling them.)

-- Ilmari Karonen

Platonides

9 Dec 9 Dec

12:47 a.m.

New subject: The never-dying topic: category intersection (been there done that .. to the power of three)

Ilmari Karonen wrote:

...

(While thinking about this, I thought back to an earlier discussion on this list (or possibly wikien-l, can't remember now) about the fact that there are essentially two types of categories: thematic and taxonomic. For the former, the "tag" model of atomic categorization is quite natural, but the latter would fit much more naturally into a strictly hierarchical model. It might not be an entirely unreasonable idea to formally split the two, perhaps even into separate namespaces, and apply different technical approaches to handling them.)

Doing that would probably help in advancing with the system.

Gregory Maxwell

2 Dec 2 Dec

11:49 p.m.

On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske magnusmanske@googlemail.com wrote: [snip]

...

Articles on en.wikipedia with "1905 births" and "1967 deaths" took <0.4 sec. OTOH, looking for images on Commons in "GFDL" and "Buildings in Berlin" took ~2min. Might be the giant GFDL category, or the toolserver, or both. I'll try to fiddle with it some more utilising cat_pages/cat_files.

No. Bleh. The horrible slowness in your results is a result of broken methodology. (2 seconds is unacceptably slow by a factor of 10x, as far as I'm concerned)

Please see: https://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-September/026715.h...

If you go around blaming big categories I will be forced hunt you down and kill you. The constant mindset of "big categories = slow" results in people building pre-made intersections to reduce category sizes rather than using atomic categories. We can make big categories blindingly fast, but we simply can not make the recursion needed to sensible outcomes on pre-made intersections fast.

I had a tool on on toolserver that gave a HTML and JSON interfaces for doing queries against your choice of enwp or commons, ... the worst case results I could get out of it were on the order of ~30ms when using up to 10 categories. I didn't bother to maintain it because I mostly got complaints that it was not useful because it didn't find most things because it couldn't walk the category tree.

5833

Age (days ago)

5839

Last active (days ago)

wikitech-l@lists.wikimedia.org

42 comments

16 participants

tags (0)

participants (16)

Aerik
Alex
Aryeh Gregor
Bryan Tong Minh
Daniel Schwen
David Gerard
Gregory Maxwell
Ilmari Karonen
Lars Aronsson
Magnus Manske
Mohamed Magdy
Nikola Smolenski
Platonides
Roan Kattouw
Robert Stojnic
Tim Landscheidt