[Commons-l] Our categories are broken, very broken

Daniel Schwen lists at schwen.de
Fri Sep 19 12:34:40 UTC 2008


On Friday 19 September 2008 05:00:19 am Maarten Dammers wrote:
> Daniel Schwen schreef:
> > It is currently almost impossible to harvest information from the
> > category tree. Database requests to find common super-categories and
> > perform boolean operations are not feasible, since they run on the order
> > of minutes.
>
> Almost impossible? Take a look at
> http://toolserver.org/~multichill/filtercats.php

Unfortunately you are comparing Apples to Apple Orchards.
And maybe I wasn't making myself clear enough. Finding common super-categories 
is actually not the biggest problem. That direction branches less then gowing 
_down_ in the tree.

Anyhow, I have been working with categories before. Check out this tool for 
the english Wikipedia:
http://toolserver.org/~dschwen/intersection/

Let's say you want an intersection of all "Actors" in "Germany". Just 
intersecting those two categories won't cut it, thanks to the myriads of sub 
categories. The tool has to deep index several levels of subcategories and 
include their content in the intersection. Try it! The default depth of 2 
levels won't be enough in most cases, and the processing time goes up 
exponentially with level.

In short, multichills super-category lookup works for one image. To perform a 
propper category intersection You'd have perform a super-category lookup for 
_every_ image on commons for each intersection!



More information about the Commons-l mailing list