[Commons-l] Our categories are broken, very broken
Daniel Schwen
lists at schwen.de
Fri Sep 19 12:34:40 UTC 2008
On Friday 19 September 2008 05:00:19 am Maarten Dammers wrote:
> Daniel Schwen schreef:
> > It is currently almost impossible to harvest information from the
> > category tree. Database requests to find common super-categories and
> > perform boolean operations are not feasible, since they run on the order
> > of minutes.
>
> Almost impossible? Take a look at
> http://toolserver.org/~multichill/filtercats.php
Unfortunately you are comparing Apples to Apple Orchards.
And maybe I wasn't making myself clear enough. Finding common super-categories
is actually not the biggest problem. That direction branches less then gowing
_down_ in the tree.
Anyhow, I have been working with categories before. Check out this tool for
the english Wikipedia:
http://toolserver.org/~dschwen/intersection/
Let's say you want an intersection of all "Actors" in "Germany". Just
intersecting those two categories won't cut it, thanks to the myriads of sub
categories. The tool has to deep index several levels of subcategories and
include their content in the intersection. Try it! The default depth of 2
levels won't be enough in most cases, and the processing time goes up
exponentially with level.
In short, multichills super-category lookup works for one image. To perform a
propper category intersection You'd have perform a super-category lookup for
_every_ image on commons for each intersection!
More information about the Commons-l
mailing list