I had hoped that it would spur interest in adopting or changing to some tag system but instead I mostly got people complaining that it didn't pick up subcategories. I pointed that that (1) picking up subcategories is technically infeasible in this kind of fast indexing scheme (imagine you move a huge second level category into another category, now the database must make millions of expensive updates) and that
Just to complete the list of tools that do this, let me point to http://toolserver.org/~daniel/WikiSense/CatScan.php.
(2) our current system produces utter nonsense if you flatten the categories (a subject I've posted on separately several times).
Mostly (2) got answers suggesting various heuristics like "only traverse N deep", which also fails but in more subtle ways (punishing deep categorization by 'losing' that content, and still producing nonsense results too just less often and less offensively) but most importantly does not solve (1).
Indeed, as is evident there. I think the only good way to get around this would be to use a facetted categorization scheme.
I think we've reached a point where many technical people here have thought about this problem for a long time (years in some cases) and over and over again have concluded that we can not have fast lookup/union/intersection/differencing tools for categories *AND* have the tree auto-expanded in the results.
True, and I have not come up with a good way either, but perhaps we should look some more into what SMW does - it does support this kind of thing, right? How does it scale? And wasn't Magnus working on some intersection thingy?
If commons were 1/20th its size and not growing at a good rate, then yes, we could. But since the full expanded category tree will not fit in ram even on my 32gbyte system, it just can not work. Categories as a useful lookup tool for users is highly limited for purely technical reasons if nothing else.
Actually, keeping in RAM the structure of, say, a million categories, where each is itself in three categories, means three million pairs of IDs, each four byte wide, that's 24MB. Not so terrible. I have actually done that to run a cycle detection on the category graphs, it's quite fast (with java, anyway). But it's not fast enough to handle deep intersection of categories for dozents if not hundreds of requests per second, as would have to be expected for wikipedia.
Otherwise, we should consider further significant technical improvements to the existing category system to be technically non-viable.
Hm... how fast is the growth of wikipedia in relation to the rate at which computing power is increasing? My impression is that the latter is coming along faster, so more complex operations become feasible with time.
Regardless of the non-technical arguments against categories the above should be sufficient reason for us to adopt a tagging system. Just so we can build really good search tools. It could run in parallel with the category system, as the category system isn't completely worthless and it will take a long time to get tags applied to everything.
I think that would be terrible. We would have two messes in stead of one. Two systems that do kind of the say, but don't interact in a meaningful way. Endless arguments, more of the gallery vs. category stuff. Ugh. No. Let's fine *one* good way.
-- daniel