In short, our category system is completely screwed up.
I've complained about this in the past, but it only seems to be getting worse.
If you follow the category hierarchy you'll find over 26,000 sub*cats* of Category:GFDL, for example.
On the mean number of decedents of commons categories is 110 subcategories after full expansion.
There are 29 categories from which you can reach at least 90% of the total categories used on commons.
Just a list of all the categories and their children is over 400mbytes of text.
Limiting the depth of traversal doesn't work because doing so would result in terrible brokenness, such as randomly failing to return a particular flower picture when someone searches for "flowers" just because someone decided to split Category:White Flowers into "Grey flowers" and "light white flowers" thus moving some of the images a level too deep.
Many of the broken linkages aren't especially deep.
This is brokenness which goes far beyond the semantic drift issues that I've argued make implicit category assignment fundamentally flawed. (although some of it is through semantic drift, cat:astronomical objects includes everything on earth through a completely useless but piecewise sensible path)
What I need to know is: What do you need from me in order to fix this?
I can provide, via JS insertion of data from a toolserver tool, a list on the category page of all the children of that category. However, which many cats with over 1000 decedents... you might be waiting a while for some pages to load.
Alternatively I could provide. via the same means. a list of the children of a category at the top of the category (working around the subcats display complaint in bugzilla).
I can provide reports such as what are the deepest linkages which carry the most children, and which categories have the most descendants.
What do we need to fix this? I've already given my solution (apply all categories that make sense to images, and only use the hierarchy to help with finding and suggesting cats).
Some background:
I recently put up a tool which enables very fast intersections of commons categories. The same back end can be used for blindingly fast geographic searches, and text searches. http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py
It's already a useful tool for finding images, but it's substantially limited by the fact that categories are broken into tiny groups and the tool can not dynamically walk the category hierarchy.
I've long argued that we need to avoid the category hierarchy issue by applying all applicable categories directly to the image, and reserve the hierarchy for pure category maintenance purposes.
I've realized that despite the merits of my position (argued elsewhere), it simply isn't going to get traction in our community.
So in the interest of making progress I started working on finding a solution to making the tool support following the category hierarchy. I've got something which I think will work fairly well, ... it doesn't break live update for image data, though it does require batch updates of the category tree data. ... I would have had it up tonight were our category data not so badly broken.
On 8/14/07, Gregory Maxwell gmaxwell@gmail.com wrote:
In short, our category system is completely screwed up.
I've complained about this in the past, but it only seems to be getting worse.
I don't think (for once) that this can be solved by throwing code at it. As Brianna has written time and again, we need something that is both category tree and tag cloud, somehow.
Qualities such a system need would be IMHO: * easy assignment by users * ability to be queried for general properties ("flower") * ability to be queried for specific subsections ("yellow rose")
So, here is my suggestion of the day: * users can assign tags (single words or phrases) to an image * users can create implications of tags * implications are fulfilled by the software automatically Example: I add the tag "MIG-29" to an image. Someone has said (or will say) that "MIG-29" implies "military aircraft", "russian aircraft", "supersonic aircraft". These, in turn, imply "aircraft" (each of them). All these tags will be added to the tag list of this image automatically (not in real time, but through an updating process in the background). Likewise, if I add an implication to a tag, all images carrying this tag will be updated automatically.
This would effectively push the "category flattening" from serachtime to creation time. It will also generate a s**tload of tags for each image, but "implied" tags could be hidden from view (though not from search) by default. So, your image would be part of queries for "aircraft", "military aircraft", or "russian aircraft". The more parameters you add, the mode precise the query gets. The intersection could probably be done by your tool, with minor adaptations, very quickly as well.
We could even use the existing category system to fill the "implication rules", so all that work was not in vain.
Magnus
On 8/14/07, Magnus Manske magnusmanske@googlemail.com wrote: [snip]
So, here is my suggestion of the day:
- users can assign tags (single words or phrases) to an image
- users can create implications of tags
- implications are fulfilled by the software automatically
Example: I add the tag "MIG-29" to an image. Someone has said (or will say) that "MIG-29" implies "military aircraft", "russian aircraft", "supersonic aircraft". These, in turn, imply "aircraft" (each of them). All these tags will be added to the tag list of this image automatically (not in real time, but through an updating process in the background). Likewise, if I add an implication to a tag, all images carrying this tag will be updated automatically.
[snip]
Your suggestion is simmlar to what I've been advocating for a while. Perhaps your particular version will gain the traction mine have failed to gain.
Perhaps instead of 'implications' being hard hard assignment they were used in a way users can reject 'implications' which don't apply. This could be accomplished through a kind of negative entry like [[-Category:foo]]. Obviously the correct this to do would be to only use implications in the right places, but there are always exceptions... it would be a shame to introduce something which couldn't handle them.
It seems your suggestion includes the concept of creating a new kind of thing, while I've previously proposed that we just repurpose categories for this application.
I don't much care if the solution uses categories, or some new syntax and foolinks table. ... it's just that categories are readily available for this purpose. Really it's all the same to me.
The only compelling reason I can see to use something other than categories for this is that we don't currently have real category redirects, which would be useful to allow a single unique tag to have representations in multiple languages. If providing yet another syntax is what it takes to get adoption then I'm all for it.
I have an existing local tool which recommends categories based on the parent categories on the image as well as the categories which are on images which are also in the existing categories or on gallery pages common to the image it is suggesting for. It would be pretty easy to convert that into a webapp, possibly even an ajax popup that inserts the new selections directly into the site.
On 8/14/07, Gregory Maxwell gmaxwell@gmail.com wrote:
Your suggestion is simmlar to what I've been advocating for a while. Perhaps your particular version will gain the traction mine have failed to gain.
Sorry to re-invent parts of your wheel :-)
Perhaps instead of 'implications' being hard hard assignment they were used in a way users can reject 'implications' which don't apply. This could be accomplished through a kind of negative entry like [[-Category:foo]]. Obviously the correct this to do would be to only use implications in the right places, but there are always exceptions... it would be a shame to introduce something which couldn't handle them.
Hmmm... [[Category:-foo]] sounds better, and wouldn't involve special syntax, only magic for categories starting with "-", which shouldn't affect any existing categories.
It seems your suggestion includes the concept of creating a new kind of thing, while I've previously proposed that we just repurpose categories for this application.
I don't much care if the solution uses categories, or some new syntax and foolinks table. ... it's just that categories are readily available for this purpose. Really it's all the same to me.
The only compelling reason I can see to use something other than categories for this is that we don't currently have real category redirects, which would be useful to allow a single unique tag to have representations in multiple languages. If providing yet another syntax is what it takes to get adoption then I'm all for it.
I was thinking along the lines of a system /not/ based on the wiki text, as categories are. Something that can be handled without editing the page. For example, I wrote a JavaScript that allows for adding/editing categories visually separate from text editing (well, almost). However, this breaks as soon as a category is transcluded through a template. There lies a source of potential coding nightmares ;-)
I have an existing local tool which recommends categories based on the parent categories on the image as well as the categories which are on images which are also in the existing categories or on gallery pages common to the image it is suggesting for. It would be pretty easy to convert that into a webapp, possibly even an ajax popup that inserts the new selections directly into the site.
Adding new categories to a page is easy enough, altering/removing them programmatically is when your hair starts to fall out ;-)
Magnus
On 8/14/07, Magnus Manske magnusmanske@googlemail.com wrote:
Sorry to re-invent parts of your wheel :-)
Gads don't apologise. It's just evidence that I'm not alone in my instanity. Your proposal is distinct from my prior rants and I was glad to see it.
I was thinking along the lines of a system /not/ based on the wiki text, as categories are. Something that can be handled without editing the page. For example, I wrote a JavaScript that allows for adding/editing categories visually separate from text editing (well, almost). However, this breaks as soon as a category is transcluded through a template. There lies a source of potential coding nightmares ;-)
[snip]
Adding new categories to a page is easy enough, altering/removing them programmatically is when your hair starts to fall out ;-)
So I know this pain... it was a driving reason why on enwp we changed the policy so that all templates signifying non-free media both had to begin with "Non-free" and could never be applyed to the page indirectly through another template which did not begin with the words 'Non-free'. (I.e. the image description page must match the regexp "{{[Nn]on-free" if any of those templates are displayed on the page)
Months of work and a million edits later, its done http://en.wikipedia.org/w/index.php?title=Special%3APrefixindex&from=Non... and the tools work much better now. ;)
In the case of image categories, I think the inability to remove a template applied category via the pretty interface would not be a big deal. Taking categories external to the wikitext would leave us a problem of figuring out how to revision control their changes, which I'm pretty confident is a mandatory feature. The obvious way to revision control them would be to append them to the Wikitext, which would put us back to where we are mostly but with some fancy sugar to hide the internals. ;)
On 8/14/07, Gregory Maxwell gmaxwell@gmail.com wrote:
I recently put up a tool which enables very fast intersections of commons categories. The same back end can be used for blindingly fast geographic searches, and text searches. http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py
Any plan to integrate this kind of functionality directly into MW? I think that would significantly affect the way in which categories are used.
"Gregory Maxwell" gmaxwell@gmail.com wrote on Tue, 14 Aug 2007 02:32:17 -0400:
I recently put up a tool which enables very fast intersections of commons categories. The same back end can be used for blindingly fast geographic searches, and text searches. http://tools.wikimedia.de/~gmaxwell/cgi-bin/cattersect.py
What about indexing or using Catscan (http://meta.wikimedia.org/wiki/CatScan)? I know it doesn't support images right now, but we could maybe talk Düsentrieb into that ...
Regards,
Flo