[Commons-l] Complete brokenness of categories is making me mad.
Gregory Maxwell
gmaxwell at gmail.com
Tue Aug 14 06:32:17 UTC 2007
In short, our category system is completely screwed up.
I've complained about this in the past, but it only seems to be getting worse.
If you follow the category hierarchy you'll find over 26,000 sub*cats*
of Category:GFDL, for example.
On the mean number of decedents of commons categories is 110
subcategories after full expansion.
There are 29 categories from which you can reach at least 90% of the
total categories used on commons.
Just a list of all the categories and their children is over 400mbytes of text.
Limiting the depth of traversal doesn't work because doing so would
result in terrible brokenness, such as randomly failing to return a
particular flower picture when someone searches for "flowers" just
because someone decided to split Category:White Flowers into "Grey
flowers" and "light white flowers" thus moving some of the images a
level too deep.
Many of the broken linkages aren't especially deep.
This is brokenness which goes far beyond the semantic drift issues
that I've argued make implicit category assignment fundamentally
flawed. (although some of it is through semantic drift,
cat:astronomical objects includes everything on earth through a
completely useless but piecewise sensible path)
What I need to know is: What do you need from me in order to fix this?
I can provide, via JS insertion of data from a toolserver tool, a list
on the category page of all the children of that category. However,
which many cats with over 1000 decedents... you might be waiting a
while for some pages to load.
Alternatively I could provide. via the same means. a list of the
children of a category at the top of the category (working around the
subcats display complaint in bugzilla).
I can provide reports such as what are the deepest linkages which
carry the most children, and which categories have the most
descendants.
What do we need to fix this? I've already given my solution (apply all
categories that make sense to images, and only use the hierarchy to
help with finding and suggesting cats).
Some background:
I recently put up a tool which enables very fast intersections of
commons categories. The same back end can be used for blindingly fast
geographic searches, and text searches.
http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py
It's already a useful tool for finding images, but it's substantially
limited by the fact that categories are broken into tiny groups and
the tool can not dynamically walk the category hierarchy.
I've long argued that we need to avoid the category hierarchy issue by
applying all applicable categories directly to the image, and reserve
the hierarchy for pure category maintenance purposes.
I've realized that despite the merits of my position (argued
elsewhere), it simply isn't going to get traction in our community.
So in the interest of making progress I started working on finding a
solution to making the tool support following the category hierarchy.
I've got something which I think will work fairly well, ... it doesn't
break live update for image data, though it does require batch updates
of the category tree data. ... I would have had it up tonight were our
category data not so badly broken.
More information about the Commons-l
mailing list