On Fri, Sep 19, 2008 at 8:34 AM, Daniel Schwen lists@schwen.de wrote:
Unfortunately you are comparing Apples to Apple Orchards. And maybe I wasn't making myself clear enough. Finding common super-categories is actually not the biggest problem. That direction branches less then gowing _down_ in the tree.
[snip]
I just wanted to step up and point out that Daniel Schwen is correct.
I created an example tool sometime back (perhaps some of you remember it?) which provided instant (i.e. like google search, <10ms computation times) results for arbitrary commons category intersections, differences, and unions.
The system worked by taking a dump of the commons categories and treating them like tags and indexing them with a special inverted index which is very good at those kinds of set operations. Because it was using the existing categories the results were not super-useful.
I had hoped that it would spur interest in adopting or changing to some tag system but instead I mostly got people complaining that it didn't pick up subcategories. I pointed that that (1) picking up subcategories is technically infeasible in this kind of fast indexing scheme (imagine you move a huge second level category into another category, now the database must make millions of expensive updates) and that (2) our current system produces utter nonsense if you flatten the categories (a subject I've posted on separately several times).
Mostly (2) got answers suggesting various heuristics like "only traverse N deep", which also fails but in more subtle ways (punishing deep categorization by 'losing' that content, and still producing nonsense results too just less often and less offensively) but most importantly does not solve (1).
I think we've reached a point where many technical people here have thought about this problem for a long time (years in some cases) and over and over again have concluded that we can not have fast lookup/union/intersection/differencing tools for categories *AND* have the tree auto-expanded in the results. If commons were 1/20th its size and not growing at a good rate, then yes, we could. But since the full expanded category tree will not fit in ram even on my 32gbyte system, it just can not work. Categories as a useful lookup tool for users is highly limited for purely technical reasons if nothing else.
If there is any technical person here who thinks otherwise, please feel free to engage me in a sidebar conversation and I'll either convince you that you're wrong, or I'll gladly eat crow. Otherwise, we should consider further significant technical improvements to the existing category system to be technically non-viable.
Regardless of the non-technical arguments against categories the above should be sufficient reason for us to adopt a tagging system. Just so we can build really good search tools. It could run in parallel with the category system, as the category system isn't completely worthless and it will take a long time to get tags applied to everything.